educe.ptb package

Conventions specific to the Penn Treebank.

The PTB isn’t a discourse corpus as such, but a supplementary resource to be combined with the RST DT or the PDTB

Submodules

educe.ptb.annotation module

Educe representation of Penn Tree Bank annotations.

We actually just use the token and constituency tree representations from educe.external.postag and educe.external.parse, but included here are tools that can also be used to align the PTB with other corpora based off the same text (eg. the RST Discourse Treebank)

educe.ptb.annotation.PTB_TO_TEXT = {"''": '"', '``': '"', '-LSB-': '[', '-RRB-': ')', '-LCB-': '{', '-LRB-': '(', '-RSB-': ']', '-RCB-': '}'}

Straight substitutions you can use to replace some PTB-isms with their likely original text

class educe.ptb.annotation.TweakedToken(word, tag, tweaked_word=None, prefix=None)

Bases: educe.external.postag.RawToken

A token with word, part of speech, plus “tweaked word” (what the token should be treated as when aligning with corpus), and offset (some tokens should skip parts of the text)

This intermediary class should only be used within the educe library itself. The context is that we sometimes want to align PTB annotations (see educe.external.postag.generic_token_spans) against text which is almost but not quite identical to the text that PTB annotations seem to represent. For example, the source text might have sentences that end in abbreviations, like “He moved to the U.S.” and the PTB might annotation an extra full stop after this for an end-of-sentence marker. To deal with these, we use wrapped tokens to allow for some manual substitutions:

  • you could “delete” a token by assigning it an empty tweaked word (it would then be assigned a zero-length span)
  • you could skip some part of the text by supplying a prefix (this expands the tweaked word, and introduces an offset which you can subsequentnly use to adjust the detected token span)
  • or you could just replace the token text outright

These tweaked tokens are only used to obtain a span within the text you are trying to align against; they can be subsequently discarded.

educe.ptb.annotation.basic_category(label)

Get the basic syntactic category of a label.

This is done by truncating whatever comes after a (non-word-initial) occurrence of one of the label_annotation_introducing_characters().

educe.ptb.annotation.is_empty_category(postag)

True if postag is the empty category, i.e. -NONE- in the PTB.

educe.ptb.annotation.is_non_empty(tree)

Filter (return False for) nodes that cover a totally empty span.

educe.ptb.annotation.is_nonword_token(text)

True if the text appears to correspond to some kind of non-textual token, for example, *T*-1 for some kind of trace. These seem to only appear with tokens tagged -NONE-.

educe.ptb.annotation.post_basic_category_index(label)

Get the index of the first char after the basic label.

This should never match the first char of the label ; if the first char is such a char, then a matched char is also not used iff there is something in between, e.g. (-LRB- => -LRB-) but (–PU => -).

educe.ptb.annotation.prune_tree(tree, filter_func)

Prune a tree by applying filter_func recursively.

All children of filtered nodes are pruned as well. Nodes whose children have all been pruned are pruned too.

The filter function must be applicable to Tree but also non-Tree, as are leaves in an NLTK Tree.

educe.ptb.annotation.strip_punctuation(tokens)

Strip leading and trailing punctuation from a sequence of tokens.

Parameters:tokens (list of Token) – Sequence of tokens.
Returns:tokens_strip – Corresponding list of tokens with no leading or trailing punctuation.
Return type:list of Token
educe.ptb.annotation.strip_subcategory(tree, retain_TMP_subcategories=False, retain_NPTMP_subcategories=False)

Transform tree to strip additional label annotation at each node

educe.ptb.annotation.syntactic_node_seq(ptree, tokens)

Find the sequence of syntactic nodes covering a sequence of tokens.

Parameters:
  • ptree (nltk.tree.Tree) – Syntactic tree.
  • tokens (sequence of Token) – Sequence of tokens under scrutiny.
Returns:

syn_nodes – Spanning sequence of nodes of the syntactic tree.

Return type:

list of nltk.tree.Tree

educe.ptb.annotation.transform_tree(tree, transformer)

Transform a tree by applying a transformer at each level.

The tree is traversed depth-first, left-to-right, and the transformer is applied at each node.

educe.ptb.head_finder module

This submodule provides several functions that find heads in trees.

It uses head rules as described in (Collins 1999), Appendix A. See http://www.cs.columbia.edu/~mcollins/papers/heads, Bikel’s 2004 CL paper on the intricacies of Collins’ parser and the classes in (StanfordNLP) CoreNLP that inherit from AbstractCollinsHeadFinder.java .

educe.ptb.head_finder.find_edu_head(tree, hwords, wanted)

Find the head word of a set of wanted nodes from a tree.

The tree is traversed top-down, breadth first, until we reach a node headed by a word from wanted.

Return a pair of treepositions (head node, head word), or None if no occurrence of any word in wanted was found.

This function is typically called for each EDU, wanted being the set of tree positions of its tokens, after find_lexical_heads has been called on the entire tree (providing hwords).

Parameters:
  • tree (nltk.Tree with educe.external.postag.RawToken leaves) – PTB tree whose lexical heads we want.
  • hwords (dict(tuple(int), tuple(int))) – Map from each node of the constituency tree to its lexical head. Both nodes are designated by their (NLTK) tree position (a.k.a. Gorn address).
  • wanted (iterable of tuple(int)) – The tree positions of the tokens in the span of interest, e.g. in the EDU we are looking at.
Returns:

  • cur_treepos (tuple(int)) – Tree position of the head node, i.e. the highest node headed by a word from wanted.
  • cur_hw (tuple(int)) – Tree position of the head word.

educe.ptb.head_finder.find_lexical_heads(tree)

Find the lexical head at each node of a constituency tree.

The logic corresponds to Collins’ head finding rules.

This is typically used to find the lexical head of each node of a (clean) educe.external.parser.ConstituencyTree whose leaves are educe.external.postag.Token.

Parameters:tree (nltk.Tree with educe.external.postag.RawToken leaves) – PTB tree whose lexical heads we want
Returns:head_word – Map each node of the constituency tree to its lexical head. Both nodes are designated by their (NLTK) tree position (a.k.a. Gorn address).
Return type:dict(tuple(int), tuple(int))
educe.ptb.head_finder.load_head_rules(f)

Load the head rules from file f.

Return a dictionary from parent non-terminal to (direction, priority list).