educe.rst_dt package¶

Conventions specific to the RST discourse treebank project

Subpackages¶

Submodules¶

educe.rst_dt.annotation module¶

Educe-style representation for RST discourse treebank trees

class educe.rst_dt.annotation.EDU(num, span, text, context=None, origin=None)¶

Bases: educe.annotation.Standoff

An RST leaf node

context = None¶: See the RSTContext object

identifier()¶: A global identifier (assuming the origin can be used to uniquely identify an RST tree)

is_left_padding()¶: Returns True for left padding EDUs

classmethod left_padding(context=None, origin=None)¶: Return a left padding EDU

num = None¶: EDU number (as used in tree node edu_span)

raw_text = None¶

text that was in the EDU annotation itself

This is not the same as the text that was in the annotated document, on which all standoff annotations and spans are based.

set_context(context)¶: Update the context of this annotation.

set_origin(origin)¶

Update the origin of this annotation and any contained within

Parameters:	origin (FileId) – File identifier of the origin of this annotation.

span = None¶: text span

text()¶: Return the text associated with this EDU. We try to return the underlying annotated text if we have the necessary context; if we not, we just fall back to the raw EDU text

class educe.rst_dt.annotation.Node(nuclearity, edu_span, span, rel, context=None)¶

Bases: object

A node in an RSTTree or SimpleRSTTree.

context = None¶: See the RSTContext object

edu_span = None¶: pair of integers denoting edu span by count

is_nucleus()¶: A node can either be a nucleus, a satellite, or a root node. It may be easier to work with SimpleRSTTree, in which nodes can only either be nucleus/satellite or much more rarely, root.

is_satellite()¶: A node can either be a nucleus, a satellite, or a root node.

nuclearity = None¶: one of Nucleus, Satellite, Root

rel = None¶: relation label (see SimpleRSTTree for a note on the different interpretation of rel with this and RSTTree)

span = None¶: span

class educe.rst_dt.annotation.RSTContext(text, sentences, paragraphs)¶

Bases: object

Additional annotations or contextual information that could accompany a RST tree proper. The idea is to have each subtree pointing back to the same context object for easy retrieval.

paragraphs = None¶: Paragraph annotations pointing back to the text

sentences = None¶: sentence annotations pointing back to the text

text(span=None)¶: Return the text associated with these annotations (or None), optionally limited to a span

class educe.rst_dt.annotation.RSTTree(node, children, origin=None, verbose=False)¶

Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff

Representation of RST trees which sticks fairly closely to the raw RST discourse treebank one.

edu_span()¶: Return the span of the tree in terms of EDU count See self.span refers more to the character offsets

get_spans(subtree_filter=None, exclude_root=False, span_type='edus')¶

Get the spans of a constituency tree.

Each span is described by a triplet (edu_span, nuclearity, relation).

Parameters:	subtree_filter (function, defaults to None) – Function to filter all local trees. exclude_root (boolean, defaults to False) – If True, exclude the span of the root node. This cannot be expressed with subtree_filter because the latter is limited to properties local to each subtree in isolation. Or maybe I just missed something. span_type (one of {'edus', 'chars'}) – Whether each span is expressed on EDU or character indices. Character indices are useful to compare spans from trees whose EDU segmentation differs.
Returns:	spans – List of tuples, each describing a span with a tuple ((edu_start, edu_end), nuclearity, relation).
Return type:	list of tuple((int, int), str, str)

set_origin(origin)¶

Update the origin of this annotation and any contained within

Parameters:	origin (FileId) – File identifier of the origin of this annotation.

text()¶: Return the text corresponding to this RST subtree. If the context is set, we return the appropriate segment from the subset of the text. If not we just concatenate the raw text of all EDU leaves.

text_span()¶

to_pdf(filename)¶: Image representation in PDF.

to_ps(filename)¶

Export as a PostScript image.

This function is used by _repr_png_.

exception educe.rst_dt.annotation.RSTTreeException(msg)¶

Bases: exceptions.Exception

Exceptions related to RST trees not looking like we would expect them to

class educe.rst_dt.annotation.SimpleRSTTree(node, children, origin=None)¶

Bases: educe.external.parser.SearchableTree, educe.annotation.Standoff

Possibly easier representation of RST trees to work with:

binary
relation labels on parent nodes instead of children

Note that RSTTree and SimpleRSTTree share the same Node type but because of the subtle difference in interpretation you should be extremely careful not to mix and match.

classmethod from_rst_tree(tree)¶: Build and return a SimpleRSTTree from an RSTTree

get_spans(subtree_filter=None, exclude_root=False, span_type='edus')¶

Get the spans of a constituency tree.

Each span is described by a triplet (edu_span, nuclearity, relation).

Parameters:	subtree_filter (function, defaults to None) – Function to filter all local trees. exclude_root (boolean, defaults to False) – If True, exclude the span of the root node. This cannot be expressed with subtree_filter because the latter is limited to properties local to each subtree in isolation. Or maybe I just missed something. span_type (one of {'edus', 'chars'}) – Whether each span is expressed on EDU or character indices. Character indices are useful to compare spans from trees whose EDU segmentation differs.
Returns:	spans – List of tuples, each describing a span with a tuple ((edu_start, edu_end), nuclearity, relation).
Return type:	list of tuple((int, int), str, str)

classmethod incorporate_nuclearity_into_label(tree)¶

Integrate nuclearity of the children into each node’s label.

Nuclearity of the children is incorporated in one of two forms, NN for multi- and NS for mono-nuclear relations.

Parameters:	tree (SimpleRSTTree) – The tree of which we want a version with nuclearity incorporated
Returns:	mod_tree – The same tree but with the type of nuclearity incorporated
Return type:	SimpleRSTTree

Note

This is probably not the best way to provide this functionality. In other words, refactoring is much needed here.

set_origin(origin)¶

Recursively update the origin for this annotation, ie. a little link to the document metadata for this annotation.

Parameters:	origin (FileId) – File identifier of the origin of this annotation.

text_span()¶

classmethod to_binary_rst_tree(tree, rel='---', nuc='Root')¶

Build and return a binary RSTTree from a SimpleRSTTree.

This function is recursive, it essentially pushes the relation label from the parent to the satellite child (for mononuclear relations) or to all nucleus children (for multinuclear relations).

Parameters:	tree (SimpleRSTTree) – SimpleRSTTree to convert rel (string, optional) – Relation for the root node of the output nuc (string, optional) – Nuclearity for the root node of the output
Returns:	rtree – The (binary) RSTTree that corresponds to the given SimpleRSTTree
Return type:	RSTTree

educe.rst_dt.annotation.is_binary(tree)¶: True if the given RST tree or SimpleRSTTree is indeed binary

educe.rst_dt.corpus module¶

Corpus management (re-exported by educe.rst_dt)

class educe.rst_dt.corpus.Reader(corpusdir)¶

Bases: educe.corpus.Reader

See educe.corpus.Reader for details

files(doc_glob=None)¶

Parameters: doc_glob (str, optional) – Glob for document names, ie. file basenames. A common pattern is doc_glob=’wsj_*’ to exclude documents whose file basenames are of the form fileX. fileX documents are damaged compared to wsj_XX documents ie. their text and that of the corresponding document in the PTB mismatch, and text formatting is scrambled. For example, the figures reported in the paper of (Li et al., 2014) indicate they only consider wsj_XX files.

slurp_subcorpus(cfiles, verbose=False)¶: See educe.rst_dt.parse for a description of RSTTree

class educe.rst_dt.corpus.RstDtParser(corpus_dir, args, coarse_rels=False, fix_pseudo_rels=False, nary_enc='chain', nuc_in_label=False, exclude_file_docs=False)¶

Bases: object

Fake parser that gets annotation from the RST-DT.

Parameters:

corpus_dir (string) – TODO
args (TODO) – TODO
coarse_rels (boolean, optional) – If True, relation labels are converted to their coarse-grained equivalent.
nary_enc (string, optional) – Conversion method from constituency to dependency tree, for n-ary spans, n > 2, whose kids are all nuclei: ‘tree’ picks the leftmost nucleus as the head of all the others (effectively a tree), ‘chain’ attaches each nucleus to its predecessor (effectively a chain).
nuc_in_label (boolean, optional) – If True, incorporate nuclearity into the label (ex: elaboration-NS) ; currently BROKEN (defined on SimpleRSTTree only).
exclude_file_docs (boolean, default False) – If True, ignore fileX files.

decode(doc_key)¶

Decode a document from the RST-DT (gold)

Parameters:	doc_key (string ?) – Identifier (in corpus) of the document we want to decode.
Returns:	doc – Bunch of information about this document notably its list of EDUs and the structures defined on them: RSTTree, SimpleRSTTree, RstDepTree.
Return type:	DocumentPlus

parse(doc)¶: Parse the document using the RST-DT (gold).

segment(doc)¶: Segment the document into EDUs using the RST-DT (gold).

class educe.rst_dt.corpus.RstRelationConverter(relmap_file)¶

Bases: object

Converter for RST relations (labels)

Known to work on RstTree, possibly SimpleRstTree (untested).

convert_dtree(dtree)¶

Change relation labels in an RstDepTree using the label mapping.

See attribute self.convert_label.

Parameters:	dtree (RstDepTree) – RST dtree
Returns:	dtree – RST dtree with mapped labels.
Return type:	RstDepTree

convert_label(label)¶: Convert a label following the mapping, lowercased otherwise

convert_tree(rst_tree)¶: Change relation labels in rst_tree using the mapping

educe.rst_dt.corpus.id_to_path(k)¶

Given a fleshed out FileId (none of the fields are None), return a filepath for it following RST Discourse Treebank conventions.

You will likely want to add your own filename extensions to this path

educe.rst_dt.corpus.mk_key(doc)¶: Return an corpus key for a given document name

educe.rst_dt.deptree module¶

Convert RST trees to dependency trees and back.

class educe.rst_dt.deptree.RstDepTree(edus=[], origin=None, nary_enc='chain')¶

Bases: object

RST dependency tree

edus¶: list of EDU – List of the EDUs of this document.

origin¶: Document?, optional – TODO

nary_enc¶: one of {‘chain’, ‘tree’}, optional – Type of encoding used for n-ary relations: ‘chain’ or ‘tree’. This determines for example how fragmented EDUs are resolved.

add_dependencies(gov_num, dep_nums, labels=None, nucs=None, rank=None)¶

Add a set of dependencies with a unique governor and rank.

Parameters:

gov_num (int) – Number of the head EDU
dep_nums (list of int) – Number of the modifier EDUs
labels (list of string, optional) – Labels of the dependencies
nuc (list of string, one of [NUC_S, NUC_N]) – Nuclearity of the modifiers
rank (integer, optional) – Rank of the modifiers in the order of attachment to the head. None means it is not given declaratively and it is instead inferred from the rank of modifiers previously attached to the head.

add_dependency(gov_num, dep_num, label=None, nuc='Satellite', rank=None)¶

Add a dependency between two EDUs.

Parameters:

gov_num (int) – Number of the head EDU
dep_num (int) – Number of the modifier EDU
label (string, optional) – Label of the dependency
nuc (string, one of [NUC_S, NUC_N]) – Nuclearity of the modifier
rank (integer, optional) – Rank of the modifier in the order of attachment to the head. None means it is not given declaratively and it is instead inferred from the rank of modifiers previously attached to the head.

append_edu(edu)¶: Append an EDU to the list of EDUs

deps(gov_idx)¶: Get the ordered list of dependents of an EDU

fragmented_edus()¶

Get the fragmented EDUs in this RST tree.

Fragmented EDUs are made of two or more EDUs linked by “same-unit” relations.

Returns:	frag_edus – Each fragmented EDU is given as a tuple of the indices of the fragments.
Return type:	list of tuple of int

classmethod from_rst_tree(rtree, nary_enc='tree')¶

Converts an ̀RSTTree` to an RstDepTree.

Parameters:	nary_enc (one of {'chain', 'tree'}) – If ‘chain’, the given RSTTree is binarized first.

classmethod from_simple_rst_tree(rtree)¶: Converts a ̀SimpleRSTTree` to an RstDepTree

get_dependencies(lbl_type='rel')¶

Get the list of dependencies in this dependency tree.

Each dependency is a 3-uple (gov, dep, label), gov and dep being EDUs.

Parameters:	lbl_type (one of {'rel', 'rel+nuc'} (TODO 'rel+nuc+rnk'?)) – Type of the labels.

real_roots_idx()¶: Get the list of the indices of the real roots

set_origin(origin)¶

Update the origin of this annotation.

Parameters:	origin (FileId) – File identifier of the origin of this annotation.

set_root(root_num)¶: Designate an EDU as a real root of the RST tree structure

spans()¶

For each EDU, get the tree span it dominates (on EDUs).

Dominance here is recursively defined.

Returns:	span_beg (array of int) – Index of the leftmost EDU dominated by an EDU. span_end (array of int) – Index of the rightmost EDU dominated by an EDU.

exception educe.rst_dt.deptree.RstDtException(msg)¶

Bases: exceptions.Exception

Exceptions related to conversion between RST and DT trees. The general expectation is that we only raise these on bad input, but in practice, you may see them more in cases of implementation error somewhere in the conversion process.

educe.rst_dt.deptree.binary_to_nary(nary_enc, pairs)¶

Retrieve nary relations from a set of binary relations.

Parameters:	nary_enc (one of {"chain", "tree"}) – Encoding from n-ary to binary relations. pairs (iterable of pairs of identifier (ex: integer, string...)) – Binary relations.
Returns:	nary_rels – Nary relations.
Return type:	list of tuples of identifiers

educe.rst_dt.document_plus module¶

This submodule implements a document with additional information.

class educe.rst_dt.document_plus.DocumentPlus(key, grouping, rst_context)¶

Bases: object

A document and relevant contextual information

align_with_doc_structure()¶

Align EDUs with the document structure (paragraph and sentence).

Determine which paragraph and sentence (if any) surrounds this EDU. Try to accomodate the occasional off-by-a-smidgen error by folks marking these EDU boundaries, eg. original text:

Para1: “Magazines are not providing us in-depth information on circulation,” said Edgar Bronfman Jr., .. “How do readers feel about the magazine?... Research doesn’t tell us whether people actually do read the magazines they subscribe to.”

Para2: Reuben Mark, chief executive of Colgate-Palmolive, said...

Marked up EDU is wide to the left by three characters: “

Reuben Mark, chief executive of Colgate-Palmolive, said...

align_with_raw_words()¶

Compute for each EDU the raw tokens it contains

This is a dirty temporary hack to enable backwards compatibility. There should be one clean text per document, one tokenization and so on, but, well.

align_with_tokens(verbose=False)¶: Compute for each EDU the overlapping tokens.

align_with_trees(strict=False)¶: Compute for each EDU the overlapping trees

all_edu_pairs(ordered=True)¶

Generate all EDU pairs of a document.

Parameters:	ordered (boolean, defaults to True) – If True, generate all ordered pairs of EDUs, otherwise (half as many) unordered pairs.
Returns:	all_pairs – All pairs of EDUs in this document.
Return type:	[(EDU, EDU)]

relations(du_pairs, lbl_type='rel', ordered=True)¶

Get the relation that holds in each of the DU pairs.

As of 2016-09-30, this function has a unique caller: doc_vectorizer.DocumentLabelExtractor._extract_labels() .

Parameters:	du_pairs ([(DU, DU)]) – List of DU pairs. lbl_type (one of {'rel', 'rel+nuc'}) – Type of label. ordered (boolean, defaults to True) – If True, du_pairs are considered ordered, otherwise the label of either (edu1, edu2) or (edu2, edu1) is returned (if not None).
Returns:	erels – Relation for each pair of DUs.
Return type:	`list` of `str`

same_unit_candidates()¶

Generate all EDU pairs that could be a same-unit.

We use the following filters: * right-attachment: i < j, * same sentence: edu2sent[i] == edu2sent[j], * len > 1: i + 1 < j

set_syn_ctrees(tkd_trees, lex_heads=None)¶

Set syntactic constituency trees for this document.

Parameters:	tkd_trees (list of nltk.tree.Tree) – Syntactic constituency trees for this document. lex_heads (list of (TODO: see find_lexical_heads), optional) – List of lexical heads for each node of each tree.

set_tokens(tokens)¶

Set tokens for this document.

Parameters:	tokens (list of Token) – List of tokens for this document.

educe.rst_dt.document_plus.align_edus_with_paragraphs(doc_edus, doc_paras, text, strict=False)¶

Align EDUs with paragraphs, if any.

Parameters:	doc_edus – doc_paras – strict –
Returns:	edu2para – Map each EDU to the index of its enclosing paragraph. If an EDU is not properly enclosed in a paragraph, the associated index is None. For files with no paragraph marking (e.g. fileX files), returns None.
Return type:	list(int or None) or None

educe.rst_dt.document_plus.containing(span)¶

span -> anno -> bool

if this annotation encloses the given span

educe.rst_dt.graph module¶

Converter from RST Discourse Treebank trees to educe-style hypergraphs

class educe.rst_dt.graph.DotGraph(anno_graph)¶

Bases: educe.graph.DotGraph

A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here

class educe.rst_dt.graph.Graph¶

Bases: educe.graph.Graph

classmethod from_doc(corpus, doc_key)¶

educe.rst_dt.parse module¶

From RST discourse treebank trees to Educe-style objects (reading the format from Di Eugenio’s corpus of instructional texts).

The main classes of interest are RSTTree and EDU. RSTTree can be treated as an NLTK Tree structure. It is also an educe Standoff object, which means that it points to other RST trees (their children) or to EDU.

educe.rst_dt.parse.parse_lightweight_tree(tstr)¶

Parse lightweight RST debug syntax into SimpleRSTTree, eg.

(R:attribution
   (N:elaboration (N foo) (S bar)
   (S quux)))

This is motly useful for debugging or for knocking out quick examples

educe.rst_dt.parse.parse_rst_dt_tree(tstr, context=None)¶: Read a single RST tree from its RST DT string representation. If context is set, align the tree with it. You should really try to pass in a context (see RSTContext if you can, the None case is really intended for testing, or in cases where you don’t have an original text)

educe.rst_dt.parse.read_annotation_file(anno_filename, text_filename)¶: Read a single RST tree

educe.rst_dt.ptb module¶

Alignment the RST-WSJ-corpus with the Penn Treebank

class educe.rst_dt.ptb.PtbParser(corpus_dir)¶

Bases: object

Gold parser that gets annotations from the PTB.

It uses an instantiated NLTK BracketedParseCorpusReader for the PTB section relevant to the RST DT corpus.

Note that the path you give to this will probably end with something like parsed/mrg/wsj

parse(doc)¶

Parse a document, using the gold PTB annotation.

Given a document, return a list of educified PTB parse trees (one per sentence).

These are almost the same as the trees that would be returned by the parsed_sents method, except that each leaf/node is associated with a span within the RST DT text.

Note: does nothing if there is no associated PTB corpus entry.

Parameters:	doc (DocumentPlus) – Rich representation of the document.
Returns:	doc – Rich representation of the document, with syntactic constituency trees.
Return type:	DocumentPlus

tokenize(doc)¶

Tokenize the document text using the PTB gold annotation.

Parameters:	doc (DocumentPlus) – Rich representation of the document.
Returns:	doc – Rich representation of the document, with tokenization.
Return type:	DocumentPlus

educe.rst_dt.ptb.align_edus_with_sentences(edus, syn_trees, strict=False)¶

Map each EDU to its sentence.

If an EDU span overlaps with more than one sentence span, the sentence with maximal overlap is chosen.

Parameters:	edus (list(EDU)) – List of EDUs. syn_trees (list(Tree)) – List of syntactic trees, one per sentence. strict (boolean, default False) – If True, raise an error if an EDU does not map to exactly one sentence.
Returns:	edu2sent – Map from EDU to (0-based) sentence index or None.
Return type:	list(int or None)

educe.rst_dt.rst_wsj_corpus module¶

This module provides loaders for file formats found in the RST-WSJ-corpus.

educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_edus_file(f)¶

Load a file that contains the EDUs of a document.

Return clean text and the list of EDU offsets on the clean text.

educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file(f)¶

Load a text file from the RST-WSJ-CORPUS.

Return the text plus its sentences and paragraphs.

The corpus contains two types of text files, so this function is mainly an entry point that delegates to the appropriate function.

educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file_file(f)¶

Load a text file whose name is of the form file##

These files do not mark paragraphs. Each line contains a sentence preceded by two or three leading spaces.

educe.rst_dt.rst_wsj_corpus.load_rst_wsj_corpus_text_file_wsj(f)¶

Load a text file whose name is of the form wsj_##

By convention:

paragraphs are separated by double newlines
sentences by single newlines

Note that this segmentation isn’t particularly reliable, and seems to both over- (e.g. cut at some abbreviations, like “Prof.”) and under-segment (e.g. not separate contiguous sentences). It shouldn’t be taken too seriously, but if you need some sort of rough approximation, it may be helpful.

educe.rst_dt.sdrt module¶

Convert RST trees to SDRT style EDU/CDU annotations.

The core of the conversion is rst_to_sdrt which produces an intermediary pointer based representation (a single CDU pointing to other CDUs and EDUs).

A fancier variant, rst_to_glozz_sdrt wraps around this core and further converts the CDU into a Glozz-friendly form

class educe.rst_dt.sdrt.CDU(members, rel_insts)¶

Complex Discourse Unit.

A CDU contains one or more discourse units, and tracks relation instances between its members. Both CDU and EDU are discourse units.

members¶: list of Unit or Scheme – Immediate member units (EDUs and CDUs) of this CDU.

rel_insts¶: list of Relation – Relation instances between immediate members of this CDU.

class educe.rst_dt.sdrt.RelInst(source, target, type)¶

Relation instance.

educe.annotation calls these ‘Relation’s which is really more in keeping with how Glozz class them, but properly speaking relation instance is a better name.

source¶: Unit? – Source of the relation instance.

target¶: Unit? – Target of the relation instance.

type¶: string – Name of the relation.

educe.rst_dt.sdrt.debug_du_to_tree(m)¶

Tree representation of CDU.

The set of relation instances is treated as the parent of each node. Loses information ; should only be used for debugging purposes.

educe.rst_dt.sdrt.rst_to_glozz_sdrt(rst_tree, annotator='ldc')¶: From an RST tree to a STAC-like version using Glozz annotations. Uses rst_to_sdrt

educe.rst_dt.sdrt.rst_to_sdrt(tree)¶

From RSTTree to CDU or EDU (recursive, top-down transformation). We recognise three patterns walking down the tree (anything else is considered to be an error):

Pre-terminal nodes: Return the leaf EDU
Mono-nuclear, N satellites: Return a CDU with a relation instance from the nucleus to each satellite. As an informal example, given X(attribution:S1, N, explanation-argumentative:S2), we return a CDU with sdrt(N) – attribution –> sdrt(S1) and sdrt(N) – explanation-argumentative –> sdrt(S2)
Multi-nuclear, 0 satellites: Return a CDU with a relation instance across each successive nucleus (assume the same relation). As an informal example, given X(List:N1, List:N2, List:N3), we return a CDU containing sdrt(N1) –List–> sdrt(N2) – List –> sdrt(N3).

educe.rst_dt.text module¶

Educe-style annotations for RST discourse treebank text objects (paragraphs and sentences)

class educe.rst_dt.text.Paragraph(num, sentences)¶

Bases: educe.annotation.Standoff

A paragraph is a sequence of `Sentence`s (also standoff annotations).

classmethod left_padding(sentences)¶: Return a left padding Paragraph

num = None¶: paragraph ID in document

sentences = None¶: sentence-level annotations

class educe.rst_dt.text.Sentence(num, span)¶

Bases: educe.annotation.Standoff

Just a text span really

classmethod left_padding()¶: Return a left padding Sentence

num = None¶: sentence ID in document

text_span()¶

educe.rst_dt.text.clean_edu_text(text)¶: Strip metadata from EDU text and compress extraneous whitespace