educe.external package

Interacting with annotations from 3rd party tools

Submodules

educe.external.coref module

Coreference chain output in the form of educe standoff annotations (at least as emitted by Stanford’s CoreNLP pipeline)

A coreference chain is considered to be a set of mentions. Each mention contains a set of tokens.

class educe.external.coref.Chain(mentions)

Bases: educe.annotation.Standoff

Chain of coreferences

class educe.external.coref.Mention(tokens, head, most_representative=False)

Bases: educe.annotation.Standoff

Mention of an entity

educe.external.corenlp module

Annotations from the CoreNLP pipeline

class educe.external.corenlp.CoreNlpDocument(tokens, trees, deptrees, chains)

Bases: educe.annotation.Standoff

All of the CoreNLP annotations for a particular document as instances of educe.annotation.Standoff or as structures that contain such instances.

class educe.external.corenlp.CoreNlpToken(t, offset, origin=None)

Bases: educe.external.postag.Token

A single token and its POS tag.

features

dict from str to str – Additional info found by corenlp about the token (eg. x.features[‘lemma’])

class educe.external.corenlp.CoreNlpWrapper(corenlp_dir)

Bases: object

Wrapper for the CoreNLP parsing system.

process(txt_files, outdir, properties=[])

Run CoreNLP on text files

Parameters:
  • txt_files (list of strings) – Input files
  • outdir (string) – Output dir
  • properties (list of strings, optional) – Properties to control the behaviour of CoreNLP
Returns:

corenlp_outdir – Directory containing CoreNLP’s output files

Return type:

string

educe.external.parser module

Syntactic parser output into educe standoff annotations (at least as emitted by Stanford’s CoreNLP pipeline

This currently builds off the NLTK Tree class, but if the NLTK dependency proves too heavy, we could consider doing without.

class educe.external.parser.ConstituencyTree(node, children, origin=None)

Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff

A variant of the NLTK Tree data structure which can be treated as an educe Standoff annotation.

This can be useful for representing syntactic parse trees in a way that can be later queried on the basis of Span enclosure.

Note that all children must have a span member of type Span

The subtrees() function can useful here.

classmethod build(tree, tokens)

Build an educe tree by combining an existing NLTK tree with some replacement leaves.

The replacement leaves should correspond 1:1 to the leaves of the original tree (for example, they may contain features related to those words).

Parameters:
  • tree (nltk.Tree) – Original NLTK tree.
  • tokens (iterable of Token) – Sequence of replacement leaves.
Returns:

ctree – ConstituencyTree where the internal nodes have the same labels as in the original NLTK tree and the leaves correspond to the given sequence of tokens.

Return type:

ConstituencyTree

text_span()

Note: doc is ignored here

class educe.external.parser.DependencyTree(node, children, link, origin=None)

Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff

A variant of the NLTK Tree data structure for the representation of dependency trees. The dependency tree is also considered a Standoff annotation but not quite in the same way that a constituency tree might be. The spans roughly indicate the range covered by the tokens in the subtree (this glosses over any gaps). They are mostly useful for determining if the tree (at its root node) pertains to any given sentence based on its offsets.

Fields:

  • node is an some annotation of type educe.annotation.Standoff
  • link is a string representing the link label between this node and its governor; None for the root node
classmethod build(deps, nodes, k, link=None)

Given two dictionaries

  • mapping node ids to a list of (link label, child node id))
  • mapping node ids to some representation of those nodes

and the id for the root node, build a tree representation of the dependency tree

is_root()

This is a dependency tree root (has a special node)

class educe.external.parser.SearchableTree(node, children)

Bases: nltk.tree.Tree

A tree with helper search functions

depth_first_iterator()

Iterate on the nodes of the tree, depth-first, pre-order.

topdown(pred, prunable=None)

Searching from the top down, return the biggest subtrees for which the predicate is True (or empty list if none are found).

The optional prunable function can be used to throw out subtrees for more efficient search (note that pred always overrides prunable though). Note that leaf nodes are ignored.

topdown_smallest(pred, prunable=None)

Searching from the top down, return the smallest subtrees for which the predicate is True (or empty list if none are found).

This is almost the same as topdown, except that if a subtree matches, we check for smaller matches in its subtrees.

Note that leaf nodes are ignored.

educe.external.postag module

CONLL formatted POS tagger output into educe standoff annotations (at least as emitted by CMU’s ark-tweet-nlp.

Files are assumed to be UTF-8 encoded.

Note: NLTK has a CONLL reader too which looks a lot more general than this one

exception educe.external.postag.EducePosTagException(*args, **kw)

Bases: exceptions.Exception

Exceptions that arise during POS tagging or when reading POS tag resources

class educe.external.postag.RawToken(word, tag)

Bases: object

A token with a part of speech tag associated with it

class educe.external.postag.Token(tok, span)

Bases: educe.external.postag.RawToken, educe.annotation.Standoff

A token with a part of speech tag and some character offsets associated with it.

classmethod left_padding()

Return a special Token for left padding

educe.external.postag.generic_token_spans(text, tokens, offset=0, txtfn=None)

Given a string and a sequence of substrings within than string, infer a span for each of the substrings.

We do this spans by walking the text and the tokens we consume substrings and skipping over any whitespace (including that which is within the tokens). For this to work, the substring sequence must be identical to the text modulo whitespace.

Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original string’s span). Empty tokens are accepted but have a zero-length span.

Note: this function is lazy so you can use it incrementally provided you can generate the tokens lazily too

You probably want token_spans instead; this function is meant to be used for similar tasks outside of pos tagging

Parameters:txtfn – function to extract text from a token (default None, treated as identity function)
educe.external.postag.read_token_file(fname)

Return a list of lists of RawToken

The input file format is what I believe to be the CONLL format (at least as emitted by the CMU Twitter POS tagger)

educe.external.postag.token_spans(text, tokens, offset=0)

Given a string and a sequence of RawToken representing tokens in that string, infer the span for each token. Return the results as a sequence of Token objects.

We infer these spans by walking the text as we consume tokens, and skipping over any whitespace in between. For this to work, the raw token text must be identical to the text modulo whitespace.

Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original string’s span).

Parameters:
  • text (str) – Base text.
  • tokens (sequence of RawToken) – Sequence of raw tokens in the text.
  • offset (int, defaults to 0) – Offset for spans.
Returns:

res – Sequence of proper educe Tokens with their span.

Return type:

list of Token

educe.external.stanford_xml_reader module

Reader for Stanford CoreNLP pipeline outputs

Example of output:

<document>
  <sentences>
    <sentence id="1">
      <tokens>
      ...
      <token id="19">
      <word>direction</word>
      <lemma>direction</lemma>
      <CharacterOffsetBegin>135</CharacterOffsetBegin>
      <CharacterOffsetEnd>144</CharacterOffsetEnd>
      <POS>NN</POS>
      </token>
      <token id="20">
      <word>.</word>
      <lemma>.</lemma>
      <CharacterOffsetBegin>144</CharacterOffsetBegin>
      <CharacterOffsetEnd>145</CharacterOffsetEnd>
      <POS>.</POS>
      </token>
      ...
      <parse>(ROOT (S (PP (IN For) (NP (NP (DT a) (NN look)) (PP (IN at) (SBAR (WHNP (WP what)) (S (VP (MD might) (VP (VB lie) (ADVP (RB ahead)) (PP (IN for) (NP (NNP U.S.) (NNS forces)))))))))) (, ,) (VP (VB let) (S (NP (POS 's)) (VP (VB turn) (PP (TO to) (NP (NP (PRP$ our) (NNP Miles) (NNP O'Brien)) (PP (IN in) (NP (NNP Atlanta)))))))) (. .))) </parse>
      <basic-dependencies>
        <dep type="prep">
          <governor idx="13">let</governor>
          <dependent idx="1">For</dependent>
        </dep>
        ...
      </basic-dependencies>
      <collapsed-dependencies>
        <dep type="det">
          <governor idx="3">look</governor>
          <dependent idx="2">a</dependent>
        </dep>
        ...
      </collapsed-dependencies>
      <collapsed-ccprocessed-dependencies>
        <dep type="det">
          <governor idx="3">look</governor>
          <dependent idx="2">a</dependent>
        </dep>
        ...
      </collapsed-ccprocessed-dependencies>
    </sentence>
  </sentences>
</document>

IMPORTANT: Note that Stanford pipeline uses RHS inclusive offsets.

class educe.external.stanford_xml_reader.PreprocessingSource(encoding='utf-8')

Bases: object

Reads in document annotations produced by CoreNLP pipeline.

This works as a stateful object that stores and provides access to all annotations contained in a CoreNLP output file, once the read method has been called.

get_coref_chains()

Get all coreference chains

get_document_id()

Get the document id

get_offset2sentence_map()

Get the offset to each sentence

get_offset2token_maps()

Get the offset to each token

get_ordered_sentence_list(sort_attr='extent')

Get the list of sentences, ordered by sort_attr

get_ordered_token_list(sort_attr='extent')

Get the list of tokens, ordered by sort_attr

get_sentence_annotations()

Get the annotations of all sentences

get_token_annotations()

Get the annotations of all tokens

read(base_file, suffix='.raw.stanford')

Read and store the annotations from CoreNLP’s output.

This function does not return anything, it modifies the state of the object to store the annotations.

educe.external.stanford_xml_reader.test_file(base_filename, suffix='.raw.stanford')

Test that a file is effectively readable and print sentences

educe.external.stanford_xml_reader.xml_unescape(_str)

Get a proper string where special XML characters are unescaped.

Notes

You can also use xml.sax.saxutils.escape