educe package

Note: At the time of this writing, this is a slightly idealised representation of the package. See below for notes on where things get a bit messier

The educe library provides utilities for working with annotated discourse corpora. It has a three-layer structure:

  • base layer (files, annotations, fusion, graphs)
  • tool layer (specific to tools, file formats, etc)
  • project layer (specific to particular corpora, currently stac)

Layers

Working our way up the tower, the base layer provides four sublayers:

  • file management (educe.corpus): basic model for corpus traversal, for selecting slices of the corpus
  • annotation: (educe.annotation), representation of annotated texts, adhering closely to whatever annotation tool produced it.
  • fusion (in progress): connections between annotations on different layers (eg. on speech acts for text spans, discourse relations), or from different tools (eg. from a POS tagger, a parser, etc)
  • graph (educe.graph): high-level/abstract representation of discourse structure, allowing for queries on the structures themselves (eg. give me all pairs for discourse units separated by at most 3 nodes in the graph)

Building on the base layer, we have modules that are specific to a particular set of annotation tools, currently this is only educe.glozz. We aim to add modules sparingly.

Finally, on top of this, we have the project layer (eg. educe.stac) which keeps track of conventions specific to this particular corpus. The hope would be for most of your script writing to deal with this layer directly, eg. for STAC

          stac                             [project layer]
            |
   +--------+-------------+--------+
   |        |             |        |
   |        v             |        |
   |      glozz           |        |       [tool layer]
   |        |             |        |
   v        v             v        v
corpus -> annotation <- fusion <- graph    [base layer]

Support for other projects would consist in adding writing other project layer modules that map down to the tool layer.

Departures from the ideal (2013-05-23)

Educe is still its early stages. Some departures you may want to be aware of:

  • fusion layer does not really exist yet; educe.annotation currently takes on some of the job (for example, the text_span function makes annotations of different types more or less comparable)
  • layer violations: ideally we want lower layers to be abstract from things above them, but you may find eg. glozz-specific assumptions in the base layer, which isn’t great.
  • inconsistency in encapsulation: educe.stac doesn’t wrap everything below it (it’s also not clear yet if it should). It currently wraps educe.glozz and educe.corpus (so by rights you shouldn’t really need to import them), but not the graph stuff for example.

Subpackages

Submodules

educe.annotation module

Low-level representation of corpus annotations, following somewhat faithfully the Glozz model for annotations.

This is low-level in the sense that we make little attempt to interpret the information stored in these annotations. For example, a relation might claim to link two units of id unit42 and unit43. This being a low-level representation, we simply note the fact. A higher-level representation might attempt to actually make the corresponding units available to you, or perhaps provide some sort of graph representation of them

class educe.annotation.Annotation(anno_id, span, atype, features, metadata=None, origin=None)

Bases: educe.annotation.Standoff

Any sort of annotation.

Annotations tend to have: * span: some sort of location (what they are annotating) * type: some key label (we call a type) * features: an attribute to value dictionary

identifier()

Global identifier if possible, else local identifier.

String representation of an identifier that should be unique to this corpus at least.

If the unit has an origin (see “FileId”), we use the

  • document
  • subdocument
  • stage
  • (but not the annotator!)
  • and the id from the XML file

If we don’t have an origin we fall back to just the id provided by the XML file.

See also position as potentially a safer alternative to this (and what we mean by safer)

local_id()

Local identifier.

An identifier which is sufficient to pick out this annotation within a single annotation file.

class educe.annotation.Document(units, relations, schemas, text)

Bases: educe.annotation.Standoff

A single (sub)-document.

This can be seen as collections of unit, relation, and schema annotations

annotations()

All annotations associated with this document

fleshout(origin)

See set_origin

global_id(local_id)

String representation of an identifier that should be unique to this corpus at least.

set_origin(origin)

If you have more than one document, it’s a good idea to set its origin to a file ID so that you can more reliably the annotations apart.

text(span=None)

Return the text associated with these annotations (or None), optionally limited to a span

class educe.annotation.RelSpan(t1, t2)

Bases: object

Which two units a relation connects.

t1 = None

string – id of an annotation

t2 = None

string – id of an annotation

class educe.annotation.Relation(rel_id, span, rtype, features, metadata=None)

Bases: educe.annotation.Annotation

An annotation between two annotations.

Relations are directed; see RelSpan for details

Use the source and target field to grab these respective annotations, but note that they are only instantiated after fleshout is called (corpus slurping normally fleshes out documents and thus their relations).

fleshout(objects)

Given a dictionary mapping ids to annotation objects, set this relation’s source and target fields.

source = None

source annotation; will be defined by fleshout

target = None

target annotation; will be defined by fleshout

class educe.annotation.Schema(rel_id, units, relations, schemas, stype, features, metadata=None)

Bases: educe.annotation.Annotation

An annotation between a set of annotations

Use the members field to grab the annotations themselves. But note that it is only created when fleshout is called.

fleshout(objects)

Given a dictionary mapping ids to annotation objects, set this schema’s members field to point to the appropriate objects

terminals()

All unit-level annotations contained in this schema or (recursively in schema contained herein)

class educe.annotation.Span(start, end)

Bases: object

What portion of text an annotation corresponds to. Assumed to be in terms of character offsets

The way we interpret spans in educe amounts to how Python interprets array slice indices.

One way to understand them is to think of offsets as sitting in between individual characters

  h   o   w   d   y
0   1   2   3   4   5

So (0,5) covers the whole word above, and (1,2) picks out the letter “o”

absolute(other)

Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.

encloses(other)

Return True if this span includes the argument

Note that x.encloses(x) == True

Corner case: x.encloses(None) == False

See also educe.graph.EnclosureGraph if you might be repeating these checks

length()

Return the length of this span

merge(other)

Return a span that stretches from the beginning to the end of the two spans. Whereas overlaps can be thought of as returning the intersection of two spans, this can be thought of as returning the union.

classmethod merge_all(spans)

Return a span that stretches from the beginning to the end of all the spans in the list

overlaps(other, inclusive=False)

Return the overlapping region if two spans have regions in common, or else None.

Span(5, 10).overlaps(Span(8, 12)) == Span(8, 10)
Span(5, 10).overlaps(Span(11, 12)) == None

If inclusive == True, spans with touching edges are considered to overlap

Span(5, 10).overlaps(Span(10, 12)) == None
Span(5, 10).overlaps(Span(10, 12), inclusive=True) == Span(10, 10)
relative(other)

Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.

shift(offset)

Return a copy of this span, shifted to the right (if offset is positive) or left (if negative).

It may be a bit more convenient to use ‘absolute/relative’ if you’re trying to work with spans that are within other spans.

class educe.annotation.Standoff(origin=None)

Bases: object

A standoff object ultimately points to some piece of text.

The pointing is not necessarily direct though.

origin

educe.corpus.FileId, optional – FileId of the document supporting this standoff.

encloses(other)

True if this annotation’s span encloses the span of the other.

s1.encloses(s2) is shorthand for s1.text_span().encloses(s2.text_span())

Parameters:other (Standoff) – Other annotation.
Returns:res – True if this annotation’s span encloses the span of the other.
Return type:boolean
overlaps(other)

True if this annotations’s span overlaps with the span of the other.

s1.overlaps(s2) is shorthand for s1.text_span().overlaps(s2.text_span())

Parameters:other (Standoff) – Other annotation.
Returns:res – True if this annotation’s span overlaps with the span of the other.
Return type:boolean
text_span()

Return the span from the earliest terminal annotation contained here to the latest.

Corner case: if this is an empty non-terminal (which would be a very weird thing indeed), return None.

Returns:res – Span from the first character of the earliest terminal annotation contained here, to the last character of the latest terminal annotation ; None if this annotation has no terminal.
Return type:Span or None
class educe.annotation.Unit(unit_id, span, utype, features, metadata=None, origin=None)

Bases: educe.annotation.Annotation

Unit annotation.

An annotation over a span of text.

position()

The position is the set of “geographical” information only to identify an item. So instead of relying on some sort of name, we might rely on its text span. We assume that some name-based elements (document name, subdocument name, stage) can double as being positional.

If the unit has an origin (see “FileId”), we use the

  • document
  • subdocument
  • stage
  • (but not the annotator!)
  • and its text span

position vs identifier

This is a trade-off. On the one hand, you can see the position as being a safer way to identify a unit, because it obviates having to worry about your naming mechanism guaranteeing stability across the board (eg. two annotators stick an annotation in the same place; does it have the same name). On the other hand, it’s a bit harder to uniquely identify objects that may coincidentally fall in the same span. So how much do you trust your IDs?

educe.corpus module

Corpus management

class educe.corpus.FileId(doc, subdoc, stage, annotator)

Information needed to uniquely identify an annotation file.

Note that this includes the annotator, so if you want to do comparisons on the “same” file between annotators you’ll want to ignore this field.

Parameters:
  • doc (string) – document name
  • subdoc (string) – subdocument (often None); sometimes you may have a need to divide a document into smaller pieces (for exmaple working with tools that require too much memory to process large documents). The subdocument identifies which piece of the document you are working with. If you don’t have a notion of subdocuments, just use None
  • stage (string) – annotation stage; for use if you have distinct files that correspond to different stages of your annotation process (or different processing tools)
  • annotator (string) – the annotator (or annotation tool) that generated this annoation file
mk_global_id(local_id)

String representation of an identifier that should be unique to this corpus at least.

If the unit has an origin (see “FileId”), we use the

  • document
  • subdocument
  • (but not the stage!)
  • (but not the annotator!)
  • and the id from the XML file

If we don’t have an origin we fall back to just the id provided by the XML file

See also position as potentially a safer alternative to this (and what we mean by safer)

class educe.corpus.Reader(root)

Reader provides little more than dictionaries from FileId to data.

Parameters:rootdir (str) – the top directory of the corpus

A potentially useful pattern to apply here is to take a slice of these dictionaries for processing. For example, you might not want to read the whole corpus, but only the files which are modified by certain annotators.

reader = Reader(corpus_dir)
files = reader.files()
subfiles = {k: v in files.items() if k.annotator in ['Bob', 'Alice']}
corpus = reader.slurp(subfiles)

Alternatively, having read in the entire corpus, you might be doing processing on various slices of it at a time

corpus = reader.slurp()
subcorpus = {k: v in corpus.items() if k.doc == 'pilot14'}

This is an abstract class; you should use the version from a data-set, eg. educe.stac.Reader instead

files(doc_glob=None)

Return a dictionary from FileId to (tuples of) filepaths. The tuples correspond to files that are considered to ‘belong’ together; for example, in the case of standoff annotation, both the text file and its annotations

Derived classes

Parameters:doc_glob (str, optional) – Glob expression for names of game folders ; if None, subclasses are expected to use the wildcard ‘*’ that matches all strings.
filter(d, pred)

Convenience function equivalent to

{ k:v for k,v in d.items() if pred(k) }
slurp(cfiles=None, doc_glob=None, verbose=False)

Read the entire corpus if cfiles is None or else the subset specified by cfiles.

Return a dictionary from FileId to educe.Annotation.Document

Parameters:
  • cfiles (dict, optional) – Dict of files like what Corpus.files() would return.
  • doc_glob (str, optional) – Glob pattern for doc (folder) names ; ignored if cfiles is not None.
  • verbose (boolean, defaults to False) – If True, print what we’re reading to stderr.
slurp_subcorpus(cfiles, verbose=False)

Derived classes should implement this function

educe.glozz module

The Glozz file format in educe.annotation form

You’re likely most interested in slurp_corpus and read_annotation_file

class educe.glozz.GlozzDocument(hashcode, unit, rels, schemas, text)

Bases: educe.annotation.Document

Representation of a glozz document

set_origin(origin)
to_xml(settings=<educe.glozz.GlozzOutputSettings object>)
exception educe.glozz.GlozzException(*args, **kw)

Bases: exceptions.Exception

class educe.glozz.GlozzOutputSettings(feature_order, metadata_order)

Bases: object

Non-essential aspects of Glozz XML output, such as the order that feature structures or metadata are written out. Controlling these settings could be useful when you want to automatically modify an existing Glozz document, but produce only minimal textual diffs along the way for revision control, comparability, etc.

educe.glozz.glozz_annotation_to_xml(self, tag='annotation', settings=<educe.glozz.GlozzOutputSettings object>)
educe.glozz.glozz_relation_to_span_xml(self)
educe.glozz.glozz_schema_to_span_xml(self)
educe.glozz.glozz_unit_to_span_xml(self)
educe.glozz.hashcode(f)

Hashcode mechanism as documented in the Glozz manual appendix. Hint, using cStringIO to get the hashcode for a string

educe.glozz.ordered_keys(preferred, d)

Keys from a dictionary starting with ‘preferred’ ones in the order of preference

educe.glozz.read_annotation_file(anno_filename, text_filename=None)

Read a single glozz annotation file and its corresponding text (if any).

educe.glozz.read_node(node, context=None)
educe.glozz.write_annotation_file(anno_filename, doc, settings=<educe.glozz.GlozzOutputSettings object>)

Write a GlozzDocument to XML in the given path

educe.graph module

Graph representation of discourse structure. Classes of interest:

  • Graph: the core structure, use the Graph.from_doc factory method to build one out of an educe.annotation document.
  • DotGraph: visual representation, built from Graph. You probably want a project-specific variant to get more helpful graphs, see eg. educe.stac.Graph.DotGraph

Educe hypergraphs

Somewhat tricky hypergraph representation of discourse structure.

  • a node for every elementary discourse unit
  • a hyperedge for every relation instance [1]
  • a hyperedge for every complex discourse unit
  • (the tricky bit) for every (hyper)edge e_x in the graph, introduce a “mirror node” n_x for that edge (this node also has e_x as its “mirror edge”)

The tricky bit is a response to two issues that arise: (A) how do we point to a CDU? Our hypergraph formalism and library doesn’t have a notion of pointing to hyperedges (only nodes) and (B) what do we do about misannotations where we have relation instances pointing to relation instances? A is the most important one to address (in principle, we could just treat B as an error and raise an exception), but for now we decide to model both scenarios, and the same “mirror” mechanism above.

The mirrors are a bit problematic because are not part of the formal graph structure (think of them as extra labels). This could lead to some seriously unintuitive consequences when traversing the graph. For example, if you two DUs A and B connected by an Elab instance, and if that instance is itself (bizarrely) connected to some other DU, you might intuitively expect A, B, and C to all form one connected component

     A
     |
Elab |
     o--------->C
     | Comment
     |
     v
     B

Alas, this is not so! The reality is a bit messier, with there being no formal relationship between edge and mirror

     A
     |
Elab |  n_ab
     |  o--------->C
     |    Comment
     |
     v
     B

The same goes for the connectedness of things pointing to CDUs and with their members. Looking at pictures, you might intuitively think that if a discourse unit (A) were connected to a CDU, it would also be connected to the discourse units within

     A
     |
Elab |
     |
     v
     +-----+
     | B C |
     +-----+

The reality is messier for the same reasons above

     A
     |
Elab |      +-----+ e_bc
     |      | B C |
     v      +-----+
     n_bc
[1]just a binary hyperedge, ie. like an edge in a regular graph. As these are undirected, we take the convention that the the first link is the tail (from) and the second link is the tail (to).

Classes

class educe.graph.AttrsMixin

Attributes common to both the hypergraph and directed graph representation of discourse structure

annotation(x)

Return the annotation object corresponding to a node or edge

edge_attributes_dict(x)
edgeform(x)

Return the argument if it is an edge id, or its mirror if it’s an edge id

(This is possible because every edge in the graph has a node that corresponds to it)

is_cdu(x)
is_edu(x)
is_relation(x)
mirror(x)

For objects (particularly, relations/CDUs) that have a mirror image, ie. an edge representation if it’s a node or vice-versa, return the identifier for that image

node(x)

DEPRECATED (renamed 2013-11-19): use self.nodeform(x) instead

node_attributes_dict(x)
nodeform(x)

Return the argument if it is a node id, or its mirror if it’s an edge id

(This is possible because every edge in the graph has a node that corresponds to it)

type(x)

Return if a node/edge is of type ‘EDU’, ‘rel’, or ‘CDU’

class educe.graph.DotGraph(anno_graph)

Bases: pydot.Dot

A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here

This is fairly abstract and unhelpful. You probably want the project-layer extension instead, eg. educe.stac.graph

exception educe.graph.DuplicateIdException(duplicate)

Bases: exceptions.Exception

Condition that arises in inconsistent corpora

class educe.graph.EnclosureDotGraph(enc_graph)

Bases: pydot.Dot

class educe.graph.EnclosureGraph(annotations, key=None)

Bases: pygraph.classes.digraph.digraph, educe.graph.AttrsMixin

Caching mechanism for span enclosure. Given an iterable of Annotation, return a directed graph where nodes point to the largest nodes they enclose (i.e. not to nodes that are enclosed by intermediary nodes they point to). As a slight twist, we also allow nodes to redundantly point to enclosed nodes of the same typ.

This should give you a multipartite graph with each layer representing a different type of annotation, but no promises! We can’t guarantee that the graph will be nicely layered because the annotations may be buggy (either nodes wrongly typed, or nodes of the same type that wrongly enclose each other), so you should not rely on this property aside from treating it as an optimisation.

Note: there is a corner case for nodes that have the same span. Technically a span encloses itself, so the graph could have a loop. If you supply a sort key that differentiates two nodes, we use it as a tie-breaker (first node encloses second). Otherwise, we simply exclude both links.

NB: nodes are labelled by their annotation id

Initialisation parameters

  • annotations - iterable of Annotation
  • key - disambiguation key for nodes with same span
    (annotation -> sort key)
inside(annotation)

Given an annotation, return all annotations that are directly within it. Results are returned in the order of their local id

outside(annotation)

Given an annotation, return all annotations it is directly enclosed in. Results are returned in the order of their local id

class educe.graph.Graph

Bases: pygraph.classes.hypergraph.hypergraph, educe.graph.AttrsMixin

Hypergraph representation of discourse structure. See the section on Educe hypergraphs

You most likely want to use Graph.from_doc instead of instantiating an instance directly

Every node/hyperedge is represented as string unique within the graph. Given one of these identifiers x and a graph g:

  • g.type(x) returns one of the strings “EDU”, “CDU”, “rel”
  • g.annotation(x) returns an educe.annotation object
  • for relations and CDUs, if e_x is the edge representation of the relation/cdu, g.mirror(x) will return its mirror node n_x and vice-versa

TODOS:

  • TODO: Currently we use educe.annotation objects to represent the EDUs, CDUs and relations, but this is likely a bit too low-level to be helpful. It may be nice to have higher-level EDU and CDU objects instead
cdu_members(cdu, deep=False)

Return the set of EDUs, CDUs, and relations which can be considered as members of this CDU.

This is shallow by default, in that we only return the immediate members of the CDU. If deep==True, also return members of CDUs that are members of (members of ..) this CDU.

cdus()

Set of hyperedges representing complex discourse units.

See also cdu_members

connected_components()

Return a set of a connected components.

Each connected component set can be passed to self.copy() to be copied as a subgraph.

This builds on python-graph’s version of a function with the same name but also adds awareness of our conventions about there being both a node/edge for relations/CDUs.

containing_cdu(node)

Given an EDU (or CDU, or relation instance), return immediate containing CDU (the hyperedge) if there is one or None otherwise. If there is more than one containing CDU, return one of them arbitrarily.

containing_cdu_chain(node)

Given an annotation, return a list which represents its containing CDU, the container’s container, and forth. Return the empty list if no CDU contains this one.

copy(nodeset=None)

Return a copy of the graph, optionally restricted to a subset of EDUs and CDUs.

Note that if you include a CDU, then anything contained by that CDU will also be included.

You don’t specify (or otherwise have control over) what relations are copied. The graph will include all hyperedges whose links are all (a) members of the subset or (b) (recursively) hyperedges included because of (a) and (b)

Note that any non-EDUs you include in the copy set will be silently ignored.

This is a shallow copy in the sense that the underlying layer of annotations and documents remains the same.

Parameters:nodeset (iterable of strings) – only copy nodes with these names
edus()

Set of nodes representing elementary discourse units

classmethod from_doc(corpus, doc_key, could_include=<function <lambda>>, pred=<function <lambda>>)

Return a graph representation of a document

Note: check the project layer for a version of this function which may be more appropriate to your project

Parameters:
  • corpus (dict from FileId to documents) – educe corpus dictionary
  • doc_key (FileId) – key pointing to the document
  • could_include (annotation -> boolean) – predicate on unit level annotations that should be included regardless of whether or not we have links to them
  • pred (annotation -> boolean) – predicate on annotations providing some requirement they must satisfy in order to be taken into account (you might say that could_include gives; and pred takes away)

Given an edge in the graph, return a tuple of its source and target nodes.

If the edge has only a single link, we assume it’s a loop and return the same value for both

relations()

Set of relation edges representing the relations in the graph. By convention, the first link is considered the source and the the second is considered the target.

educe.internalutil module

Utility functions which are meant to be used by educe but aren’t expected to be too useful outside of it

exception educe.internalutil.EduceXmlException(*args, **kw)

Bases: exceptions.Exception

educe.internalutil.indent_xml(elem, level=0)

From <http://effbot.org/zone/element-lib.htm>

WARNING: destructive

educe.internalutil.linebreak_xml(elem)

Insert a break after each element tag

You probably want indent_xml instead

educe.internalutil.on_single_element(root, default, f, name)

Return

  • the default if no elements
  • f(the node) if one element
  • an exception if more than one
educe.internalutil.treenode(tree)

API-change padding for NLTK 2 vs NLTK 3 trees

educe.util module

Miscellaneous utility functions

educe.util.FILEID_FIELDS = ['stage', 'doc', 'subdoc', 'annotator']

String representation of fields recognised in an educe.corpus.FileId

educe.util.add_corpus_filters(parser, fields=None, choice_fields=None)

For help with script-building:

Augment an argparser with options to filter a corpus on the various attributes in a ‘educe.corpus.FileId’ (eg, document, annotator).

Parameters:
  • fields ([String]) – which flag names to include (defaults to FILEID_FIELDS)
  • choice_fields (Dict String [String]) – fields which accept a limited range of answers

Meant to be used in conjunction with mk_is_interesting

educe.util.add_subcommand(subparsers, module)

Add a subcommand to an argparser following some conventions:

  • the module can have an optional NAME constant (giving the name of the command); otherwise we assume it’s the unqualified module name
  • the first line of its docstring is its help text
  • subsequent lines (if any) form its epilog

Returns the resulting subparser for the module

educe.util.concat(items)

:: Iterable (Iterable a) -> Iterable a

educe.util.concat_l(items)

:: [[a]] -> [a]

educe.util.fields_without(unwanted)

Fields for add_corpus_filters without the unwanted members

educe.util.mk_is_interesting(args, preselected=None)

Return a function that when given a FileId returns ‘True’ if the FileId would be considered interesting according to the arguments passed in.

Parameters:preselected (Dict String [String]) – fields for which we already know what matches we want

Meant to be used in conjunction with add_corpus_filters

educe.util.relative_indices(group_indices, reverse=False, valna=None)

Generate a list of relative indices inside each group. Missing (None) values are handled specifically: each missing value is mapped to valna.

Parameters:
  • reverse (boolean, optional) – If True, compute indices relative to the end of each group.
  • valna (int or None, optional) – Relative index for missing values.