educe.external package¶
Interacting with annotations from 3rd party tools
Submodules¶
educe.external.coref module¶
Coreference chain output in the form of educe standoff annotations (at least as emitted by Stanford’s CoreNLP pipeline)
A coreference chain is considered to be a set of mentions. Each mention contains a set of tokens.
-
class
educe.external.coref.
Chain
(mentions)¶ Bases:
educe.annotation.Standoff
Chain of coreferences
-
class
educe.external.coref.
Mention
(tokens, head, most_representative=False)¶ Bases:
educe.annotation.Standoff
Mention of an entity
educe.external.corenlp module¶
Annotations from the CoreNLP pipeline
-
class
educe.external.corenlp.
CoreNlpDocument
(tokens, trees, deptrees, chains)¶ Bases:
educe.annotation.Standoff
All of the CoreNLP annotations for a particular document as instances of educe.annotation.Standoff or as structures that contain such instances.
-
class
educe.external.corenlp.
CoreNlpToken
(t, offset, origin=None)¶ Bases:
educe.external.postag.Token
A single token and its POS tag.
-
features
¶ dict from str to str – Additional info found by corenlp about the token (eg. x.features[‘lemma’])
-
-
class
educe.external.corenlp.
CoreNlpWrapper
(corenlp_dir)¶ Bases:
object
Wrapper for the CoreNLP parsing system.
-
process
(txt_files, outdir, properties=[])¶ Run CoreNLP on text files
Parameters: - txt_files (list of strings) – Input files
- outdir (string) – Output dir
- properties (list of strings, optional) – Properties to control the behaviour of CoreNLP
Returns: corenlp_outdir – Directory containing CoreNLP’s output files
Return type: string
-
educe.external.parser module¶
Syntactic parser output into educe standoff annotations (at least as emitted by Stanford’s CoreNLP pipeline
This currently builds off the NLTK Tree class, but if the NLTK dependency proves too heavy, we could consider doing without.
-
class
educe.external.parser.
ConstituencyTree
(node, children, origin=None)¶ Bases:
educe.external.parser.SearchableTree
,educe.annotation.Standoff
A variant of the NLTK Tree data structure which can be treated as an educe Standoff annotation.
This can be useful for representing syntactic parse trees in a way that can be later queried on the basis of Span enclosure.
Note that all children must have a span member of type Span
The subtrees() function can useful here.
-
classmethod
build
(tree, tokens)¶ Build an educe tree by combining an existing NLTK tree with some replacement leaves.
The replacement leaves should correspond 1:1 to the leaves of the original tree (for example, they may contain features related to those words).
Parameters: - tree (nltk.Tree) – Original NLTK tree.
- tokens (iterable of Token) – Sequence of replacement leaves.
Returns: ctree – ConstituencyTree where the internal nodes have the same labels as in the original NLTK tree and the leaves correspond to the given sequence of tokens.
Return type:
-
text_span
()¶ Note: doc is ignored here
-
classmethod
-
class
educe.external.parser.
DependencyTree
(node, children, link, origin=None)¶ Bases:
educe.external.parser.SearchableTree
,educe.annotation.Standoff
A variant of the NLTK Tree data structure for the representation of dependency trees. The dependency tree is also considered a Standoff annotation but not quite in the same way that a constituency tree might be. The spans roughly indicate the range covered by the tokens in the subtree (this glosses over any gaps). They are mostly useful for determining if the tree (at its root node) pertains to any given sentence based on its offsets.
Fields:
- node is an some annotation of type educe.annotation.Standoff
- link is a string representing the link label between this node and its governor; None for the root node
-
classmethod
build
(deps, nodes, k, link=None)¶ Given two dictionaries
- mapping node ids to a list of (link label, child node id))
- mapping node ids to some representation of those nodes
and the id for the root node, build a tree representation of the dependency tree
-
is_root
()¶ This is a dependency tree root (has a special node)
-
class
educe.external.parser.
SearchableTree
(node, children)¶ Bases:
nltk.tree.Tree
A tree with helper search functions
-
depth_first_iterator
()¶ Iterate on the nodes of the tree, depth-first, pre-order.
-
topdown
(pred, prunable=None)¶ Searching from the top down, return the biggest subtrees for which the predicate is True (or empty list if none are found).
The optional prunable function can be used to throw out subtrees for more efficient search (note that pred always overrides prunable though). Note that leaf nodes are ignored.
-
topdown_smallest
(pred, prunable=None)¶ Searching from the top down, return the smallest subtrees for which the predicate is True (or empty list if none are found).
This is almost the same as topdown, except that if a subtree matches, we check for smaller matches in its subtrees.
Note that leaf nodes are ignored.
-
educe.external.postag module¶
CONLL formatted POS tagger output into educe standoff annotations (at least as emitted by CMU’s ark-tweet-nlp.
Files are assumed to be UTF-8 encoded.
Note: NLTK has a CONLL reader too which looks a lot more general than this one
-
exception
educe.external.postag.
EducePosTagException
(*args, **kw)¶ Bases:
exceptions.Exception
Exceptions that arise during POS tagging or when reading POS tag resources
-
class
educe.external.postag.
RawToken
(word, tag)¶ Bases:
object
A token with a part of speech tag associated with it
-
class
educe.external.postag.
Token
(tok, span)¶ Bases:
educe.external.postag.RawToken
,educe.annotation.Standoff
A token with a part of speech tag and some character offsets associated with it.
-
classmethod
left_padding
()¶ Return a special Token for left padding
-
classmethod
-
educe.external.postag.
generic_token_spans
(text, tokens, offset=0, txtfn=None)¶ Given a string and a sequence of substrings within than string, infer a span for each of the substrings.
We do this spans by walking the text and the tokens we consume substrings and skipping over any whitespace (including that which is within the tokens). For this to work, the substring sequence must be identical to the text modulo whitespace.
Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original string’s span). Empty tokens are accepted but have a zero-length span.
Note: this function is lazy so you can use it incrementally provided you can generate the tokens lazily too
You probably want token_spans instead; this function is meant to be used for similar tasks outside of pos tagging
Parameters: txtfn – function to extract text from a token (default None, treated as identity function)
-
educe.external.postag.
read_token_file
(fname)¶ Return a list of lists of RawToken
The input file format is what I believe to be the CONLL format (at least as emitted by the CMU Twitter POS tagger)
-
educe.external.postag.
token_spans
(text, tokens, offset=0)¶ Given a string and a sequence of RawToken representing tokens in that string, infer the span for each token. Return the results as a sequence of Token objects.
We infer these spans by walking the text as we consume tokens, and skipping over any whitespace in between. For this to work, the raw token text must be identical to the text modulo whitespace.
Spans are relative to the start of the string itself, but can be shifted by passing an offset (the start of the original string’s span).
Parameters: - text (str) – Base text.
- tokens (sequence of RawToken) – Sequence of raw tokens in the text.
- offset (int, defaults to 0) – Offset for spans.
Returns: res – Sequence of proper educe Tokens with their span.
Return type: list of Token
educe.external.stanford_xml_reader module¶
Reader for Stanford CoreNLP pipeline outputs
Example of output:
<document>
<sentences>
<sentence id="1">
<tokens>
...
<token id="19">
<word>direction</word>
<lemma>direction</lemma>
<CharacterOffsetBegin>135</CharacterOffsetBegin>
<CharacterOffsetEnd>144</CharacterOffsetEnd>
<POS>NN</POS>
</token>
<token id="20">
<word>.</word>
<lemma>.</lemma>
<CharacterOffsetBegin>144</CharacterOffsetBegin>
<CharacterOffsetEnd>145</CharacterOffsetEnd>
<POS>.</POS>
</token>
...
<parse>(ROOT (S (PP (IN For) (NP (NP (DT a) (NN look)) (PP (IN at) (SBAR (WHNP (WP what)) (S (VP (MD might) (VP (VB lie) (ADVP (RB ahead)) (PP (IN for) (NP (NNP U.S.) (NNS forces)))))))))) (, ,) (VP (VB let) (S (NP (POS 's)) (VP (VB turn) (PP (TO to) (NP (NP (PRP$ our) (NNP Miles) (NNP O'Brien)) (PP (IN in) (NP (NNP Atlanta)))))))) (. .))) </parse>
<basic-dependencies>
<dep type="prep">
<governor idx="13">let</governor>
<dependent idx="1">For</dependent>
</dep>
...
</basic-dependencies>
<collapsed-dependencies>
<dep type="det">
<governor idx="3">look</governor>
<dependent idx="2">a</dependent>
</dep>
...
</collapsed-dependencies>
<collapsed-ccprocessed-dependencies>
<dep type="det">
<governor idx="3">look</governor>
<dependent idx="2">a</dependent>
</dep>
...
</collapsed-ccprocessed-dependencies>
</sentence>
</sentences>
</document>
IMPORTANT: Note that Stanford pipeline uses RHS inclusive offsets.
-
class
educe.external.stanford_xml_reader.
PreprocessingSource
(encoding='utf-8')¶ Bases:
object
Reads in document annotations produced by CoreNLP pipeline.
This works as a stateful object that stores and provides access to all annotations contained in a CoreNLP output file, once the read method has been called.
-
get_coref_chains
()¶ Get all coreference chains
-
get_document_id
()¶ Get the document id
-
get_offset2sentence_map
()¶ Get the offset to each sentence
-
get_offset2token_maps
()¶ Get the offset to each token
-
get_ordered_sentence_list
(sort_attr='extent')¶ Get the list of sentences, ordered by sort_attr
-
get_ordered_token_list
(sort_attr='extent')¶ Get the list of tokens, ordered by sort_attr
-
get_sentence_annotations
()¶ Get the annotations of all sentences
-
get_token_annotations
()¶ Get the annotations of all tokens
-
read
(base_file, suffix='.raw.stanford')¶ Read and store the annotations from CoreNLP’s output.
This function does not return anything, it modifies the state of the object to store the annotations.
-
-
educe.external.stanford_xml_reader.
test_file
(base_filename, suffix='.raw.stanford')¶ Test that a file is effectively readable and print sentences
-
educe.external.stanford_xml_reader.
xml_unescape
(_str)¶ Get a proper string where special XML characters are unescaped.
Notes
You can also use xml.sax.saxutils.escape