educe.stac package

Conventions specific to the STAC project

This includes things like

  • corpus layout (see corpus_files)
  • which annotations are of interest
  • renaming/deleting/collapsing annotation labels

Submodules

educe.stac.annotation module

STAC annotation conventions (re-exported in educe.stac)

STAC/Glozz annotations can be a bit confusing because for two reasons, first that Glozz objects are used to annotate very different things; and second that annotations are done on different stages

Stage 1 (units)

Glozz Uses
units doc structure, EDUs, resources, preferences
relations coreference
schemas composite resources

Stage 2 (discourse)

Glozz Uses
units doc structure, EDUs
relations relation instances, coreference
schemas CDUs

Units

There is a typology of unit types worth noting:

  • doc structure : type eg. Dialogue, Turn, paragraph
  • resources : subspans of segments (type Resource)
  • preferences : subspans of segments (type Preference)
  • EDUs : spans of text associated with a dialogue act (eg. type Offer, Accept) (during discourse stage, these are just type Segment)

Relations

  • coreference : (type Anaphora)
  • relation instances : links between EDUs, annotated with relation label (eg. type Elaboration, type Contrast, etc). These can be further divided in subordinating or coordination relation instances according to their label

Schemas

  • composite resources : boolean combinations of resources (eg. “sheep or ore”)
  • CDUs: type Complex_discourse_unit (discourse stage)
class educe.stac.annotation.PartialUnit

Bases: educe.stac.annotation.PartialUnit

Partially instantiated unit, for use when you want to programmatically insert annotations into a document

A partially instantiated unit does not have any metadata (creation date, etc); as these will be derived automatically

educe.stac.annotation.RENAMES = {'Strategic_comment': 'Other', 'Segment': 'Other'}

Dialogue acts that should be treated as a different one

class educe.stac.annotation.TurnId

Bases: tuple

Turn identifier akin to a Gorn address.

A Gorn address is a tuple of integers.

classmethod from_string(tid_str)

Create a TurnId from a string.

ex: (21.0.1)

educe.stac.annotation.addressees(anno)

The set of people spoken to during an edu annotation

Annotation -> Set String

Note: this returns None if the value is the default ‘Please choose...’; but otherwise, it preserves values like ‘All’ or ‘?’.

educe.stac.annotation.cleanup_comments(anno)

Strip out default comment text from features. This placeholder text was inserted as a UI aid during editing in Glozz, but isn’t actually the comment itself

educe.stac.annotation.create_units(_, doc, author, partial_units)

Create a collection of units from their partial specification.

Parameters:
  • _ (anything) – Anonymous parameter whose value is ignored. It was apparently supposed to contain a FileId. I suppose the intention was to follow a signature similar to other functions.
  • doc (Document) – Containing document.
  • author (string) – Author for the new units.
  • partial_units (iterable of PartialUnit) – Partial specification of the new units.
Returns:

res – Collection of instantiated new unit objects.

Return type:

list of Unit

Notes

As of 2016-05-11, this function does not seem to be used anymore in the codebase. It used to be called in irit-stac/segmentation/glozz-segment, which was deleted 2015-06-08 (commit e2373c03) because it was not used.

educe.stac.annotation.dialogue_act(anno)

Set of dialogue act (aka speech act) annotations for a Unit, taking into consideration STAC conventions like collapsing Strategic_comment into Other

By rights should be singleton set, but there used to be more than one, something we want to phase out?

educe.stac.annotation.game_turns(doc, turns, gen=2)

Group a sequence of turns into a sequence of game turns.

A game turn corresponds to the sequence of events (turns) that happen within a player’s turn (in the SOC game).

Parameters:
  • doc (Document) – Containing document.
  • turns (list of educe.stac.Unit) – Events (of type Turn) from the game: server messages, player messages.
Returns:

gturn_beg – Index of the first Turn of each game turn.

Return type:

list of int

educe.stac.annotation.is_cdu(annotation)

See CDUs typology above

educe.stac.annotation.is_coordinating(annotation)

See Relation typology above

educe.stac.annotation.is_dialogue(annotation)

See Unit typology above

educe.stac.annotation.is_dialogue_act(annotation)

Deprecated in favour of is_edu

educe.stac.annotation.is_edu(annotation)

See Unit typology above

educe.stac.annotation.is_paragraph(annotation)

See Unit typology above

educe.stac.annotation.is_preference(annotation)

See Unit typology above

educe.stac.annotation.is_relation_instance(annotation)

See Relation typology above

educe.stac.annotation.is_resource(annotation)

See Unit typology above

educe.stac.annotation.is_structure(annotation)

Is one of the document-structure annotations, something an annotator is expected not to edit, create, delete

educe.stac.annotation.is_subordinating(annotation)

See Relation typology above

educe.stac.annotation.is_turn(annotation)

See Unit typology above

educe.stac.annotation.is_turn_star(annotation)

See Unit typology above

educe.stac.annotation.relation_labels(anno)

Set of relation labels (eg. Elaboration, Explanation), taking into consideration any applicable STAC-isms

educe.stac.annotation.set_addressees(anno, addr)

Set the addresee list for an annotation. If the value None is provided, the addressee list is deleted (if present)

(Iterable String, Annotation) -> IO ()
educe.stac.annotation.speaker(anno)

Return the speaker associated with a turn annotation. NB: crashes if there is none

educe.stac.annotation.split_turn_text(text)

STAC turn texts are prefixed with a turn number and speaker to help the annotators (eg. “379: Bob: I think it’s your go, Alice”).

Given the text for a turn, split the string into a prefix containing this turn/speaker information (eg. “379: Bob: ”), and a body containing the turn text itself (eg. “I think it’s your go, Alice”).

Mind your offsets! They’re based on the whole turn string.

educe.stac.annotation.split_type(anno)

An object’s type as a (frozen)set of items. You’re probably looking for educe.stac.dialogue_act instead.

educe.stac.annotation.turn_id(anno)

Get the turn identifier for a turn annotation (or None).

Parameters:anno (Annotation) – Annotation
Returns:turn_id – Turn identifier ; None if the annotation has no feature ‘Identifier’.
Return type:tuple(int) or None
educe.stac.annotation.twin(corpus, anno, stage='units')

Given an annotation in a corpus, retrieve the equivalent annotation (by local identifier) from a a different stage of the corpus. Return this “twin” annotation or None if it is not found

Note that the annotation’s origin must be set

The typical use of this would be if you have an EDU in the ‘discourse’ stage and need to get its ‘units’ stage equvialent to have its dialogue act.

Parameters:twin_doc – unit-level document to fish twin from (None if you want educe to search for it in the corpus; NB: corpus can be None if you supply this)
educe.stac.annotation.twin_from(doc, anno)

Given a document and an annotation, return the first annotation in the document with a matching local identifier.

educe.stac.context module

The dialogue and turn surrounding an EDU along with some convenient information about it

class educe.stac.context.Context(turn, tstar, turn_edus, dialogue, dialogue_turns, doc_turns, tokens=None)

Bases: object

Representation of the surrounding context for an EDU, basically the relevant enclosing annotations: turns, dialogues. The idea is potentially extend this to a somewhat richer notion of context, including things like a sentence count, etc.

Parameters:
  • turn – the turn surrounding this EDU
  • tstar – the tstar turn surrounding this EDU (a tstar turn is a sort of virtual turn made by merging consecutive turns in a dialogue that have the same speaker)
  • turn_edus – the EDUs in the this turn
  • dialogue – the dialogue surrounding this EDU
  • dialogue_turns – all the turns in the dialogue surrounding this EDU (non-empty, sorted by first-widest span)
  • doc_turns – all the turns in the document
  • tokens – (may not be present): tokens contained within this EDU
classmethod for_edus(doc, postags=None)

Get a dictionary of context objects for each EDU in the doc.

Returns:contexts – A dictionary with a context for each EDU in the document.
Return type:dict(educe.glozz.Unit, Context)
speaker()

the speaker associated with the turn surrounding an edu

educe.stac.context.containing(span, annos)

Given an iterable of standoff, pick just those that enclose/contain the given span (ie. are bigger and around)

educe.stac.context.edus_in_span(doc, span)

Given an document and a text span return the EDUs the document contains in that span

educe.stac.context.enclosed(span, annos)

Given an iterable of standoff, pick just those that are enclosed by the given span (ie. are smaller and within)

educe.stac.context.merge_turn_stars(doc)

Return a copy of the document in which consecutive turns by the same speaker have been merged.

Merging is done by taking the first turn in grouping of consecutive speaker turns, and stretching its span over all the subsequent turns.

Additionally turn prefix text (containing turn numbers and speakers) from the removed turns are stripped out.

educe.stac.context.sorted_first_widest(nodes)

Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse width (ie. widest first).

educe.stac.context.speakers(contexts, anno)

Return a list of speakers of an EDU or CDU (in the textual order of the EDUs).

educe.stac.context.turns_in_span(doc, span)

Given a document and a text span, return the turns that the document contains in that span

educe.stac.corenlp module

STAC conventions for running the Stanford CoreNLP pipeline, saving the results, and reading them.

The most useful functions here are

  • run_pipeline
  • read_results
educe.stac.corenlp.from_corenlp_output_filename(f)

Return a tuple of FileId and turn id.

This is entirely by convention we established when calling corenlp of course

educe.stac.corenlp.parsed_file_name(k, dir_name)

Given an educe.corpus.FileId and directory, return the file path within that directory that corresponds to the corenlp output

educe.stac.corenlp.read_corenlp_result(doc, corenlp_doc, tid=None)

Read CoreNLP’s output for a document.

Parameters:
Returns:

corenlp_doc – A CoreNlpDocument containing all information.

Return type:

CoreNlpDocument

educe.stac.corenlp.read_results(corpus, dir_name)

Read stored parser output from a directory, and convert them to educe.annotation.Standoff objects.

Return a dictionary mapping ‘FileId’s to sets of tokens.

educe.stac.corenlp.run_pipeline(corpus, outdir, corenlp_dir, split=False)

Run the standard corenlp pipeline on all the (unannotated) documents in the corpus and save the results in the specified directory.

If split=True, we output one file per turn, an experimental mode to account for switching between multiple speakers. We don’t have all the infrastructure to read these back in (it should just be a matter of some filename manipulation though) and hope to flesh this out later. We also intend to tweak the notion of splitting by aggregating consecutive turns with the same speaker, which may somewhat mitigate the loss of coreference information.

educe.stac.corenlp.turn_id_text(doc)

Return a list of (turn ids, text) tuples in span order (no speaker)

educe.stac.corpus module

Corpus layout conventions (re-exported by educe.stac)

class educe.stac.corpus.LiveInputReader(corpusdir)

Bases: educe.stac.corpus.Reader

Reader for unannotated ‘live’ data that we want to parse.

The data is assumed to be in a directory with one aa/ac file pair.

There is no notion of subdocument (subdoc = None) and the stage is ‘unannotated’

files(doc_glob=None)
Parameters:doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘*’ for file basenames.
class educe.stac.corpus.Reader(corpusdir)

Bases: educe.corpus.Reader

See educe.corpus.Reader for details

files(doc_glob=None)

Gather files for docs whose folder name matches doc_glob.

Parameters:doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘*’ to match all strings.
slurp_subcorpus(cfiles, verbose=False)
educe.stac.corpus.id_to_path(k)

Given a fleshed out FileId (none of the fields are None), return a filepath for it following STAC conventions.

You will likely want to add your own filename extensions to this path

educe.stac.corpus.is_metal(fileid)

If the annotator is one of the distinguished standard annotators

educe.stac.corpus.twin_key(key, stage)

Given an annotation key, return a copy shifted over to a different stage.

Note that copying from unannotated to another stage, you will need to set the annotator

educe.stac.corpus.write_annotation_file(anno_filename, doc)

Write a GlozzDocument to XML in the given path

educe.stac.fake_graph module

Fake graphs for testing STAC algorithms

Specification for mini-language

Source string is parsed line by line, data type depends on first character Uppercase letters are speakers, lowercase letters are units EDU names are arranged following alphabetical order (does NOT apply to CDUs) Please arrange the lines in that order:

  • # : speaker line

    # Aabce Bdg Cfh
    
  • any lowercase : CDU line (top-level last)

    y(eg) x(wyz)
    
  • S or C : relation line

    Sabd bf ceCh
    

anything else : skip as comment

class educe.stac.fake_graph.LightGraph(src)

Structure holding only relevant information

Unit keys (sortable, hashable) must correspond to reading order CDUs can be placed in any position wrt their components

get_doc()
get_edge(source, target)

Return an educe.annotation.Relation for the given LightGraph names for source and target

get_node(name)

Return an educe.annotation.Unit or Schema for the given LightGraph name

educe.stac.fusion module

Somewhat higher level representation of STAC documents than the usual Glozz layer.

Note that this is a relatively recent addition to Educe. Up to the time of this writing (2015-03), we had two options for dealing with STAC:

  • manually manipulating glozz objects via educe.annotation
  • dealing with some high-level but not particularly helpful hypergraph objects

We try to provide an intermediary in this layer by merging information from several layers in one place.

A typical example might be to print a listing of

(edu1_id, edu2_id, edu1_dialogue_act, edu2_dialogue_act, relation_label)

This has always been a bit awkward when dealing with Glozz, because there are separate annotations in different Glozz documents, the dialogue acts in the ‘units’ stage; and the linked units in the discourse stage. Combining these streams has always involved a certain amount of manual lookup, which we hope to avoid with this fusion layer.

At the time of this writing, this will have a bit of emphasis on feature extraction.

class educe.stac.fusion.Dialogue(anno, edus, relations)

Bases: object

STAC Dialogue.

Note that input EDUs should be sorted by span.

edu_pairs()

Generate all EDU pairs within this dialogue.

This includes pairs whose source is the left padding (fake root) EDU.

Yields:(source, target) (tuple of educe.stac.annotation.Unit) – Next candidate edge, as a pair of EDUs (source, target).
class educe.stac.fusion.EDU(doc, discourse_anno, unit_anno)

Bases: educe.annotation.Unit

STAC EDU

A STAC EDU merges information from the unit and discourse annotation stages so that you can ignore the distinction between the two annotation stages.

It also tries to be usable as a drop-in substitute for both annotations and contexts

dialogue_act()

The (normalised) speech act associated with this EDU (None if unknown)

fleshout(context)

second phase of EDU initialisation; fill out contextual info

identifier()

Some kind of identifier string that uniquely identfies the EDU in the corpus. Because these are higher level annotations than in the Glozz layer we will use the ‘local’ identifier, which should be the same across stages

is_left_padding()

If this is a virtual EDU used in machine learning tasks

speaker()

the speaker associated with the turn surrounding an edu

subgrouping()

What abstract subgrouping the EDU is in (here: turn stars)

Returns:subgrouping
Return type:string
text()

The text for just this EDU

educe.stac.fusion.ROOT = 'ROOT'

distinguished fake EDU id for machine learning applications

educe.stac.fusion.fuse_edus(discourse_doc, unit_doc, postags)

Return a copy of the discourse level doc, merging info from both the discourse and units stage.

All EDUs will be converted to higher level EDUs.

Notes

  • The discourse stage is primary in that we work by going over what EDUs we find in the discourse stage and trying to enhance them with information we find on their units-level equivalents. Sometimes (rarely but it happens) annotations can go out of synch. EDUs missing on the units stage will be silently ignored (we try to make do without them). EDUs that were introduced on the units stage but not percolated to discourse will also be ignored.
  • We rely on annotation ids to match EDUs from both stages; it’s up to you to ensure that the annotations are really in synch.
  • This does not constitute a full merge of the documents. For a full merge, you would have to bring over other annotations such as Resources, Preference, Anaphor, Several_resources, taking care all the while to ensure there are no timestamp clashes with pre-existing annotations (it’s unlikely but best be on the safe side if you ever find yourself with automatically generated annotations, where all bets are off time-stamp wise).
Parameters:
  • discourse_doc (GlozzDocument) – Document from the “discourse” stage.
  • unit_doc (GlozzDocument) – Document from the “units” stage.
  • postags (list of Token) – Sequence of educe tokens predicted by the POS tagger for this document.
Returns:

doc – Deep copy of the discourse_doc with info from the units stage merged in.

Return type:

GlozzDocument

educe.stac.graph module

STAC-specific conventions related to graphs.

class educe.stac.graph.DotGraph(anno_graph)

Bases: educe.graph.DotGraph

A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here

class educe.stac.graph.EnclosureDotGraph(core)

Bases: educe.graph.EnclosureDotGraph

Conventions for visualising STAC enclosure graphs

class educe.stac.graph.EnclosureGraph(doc, postags=None)

Bases: educe.graph.EnclosureGraph

An enclosure graph based on STAC conventions

class educe.stac.graph.Graph

Bases: educe.graph.Graph

cdu_head(cdu, sloppy=False)

Get the head DU of a CDU.

The head of a CDU is defined here as the only DU that is not pointed to by any other member of this CDU.

This is meant to approximate the description in (Muller 2012) (/Constrained decoding for text-level discourse parsing/):

  1. in the highest DU in its subgraph in terms of suboordinate relations,
  2. in case of a tie in #1, the leftmost in terms of coordinate relations.

Corner cases:

  • Return None if the CDU has no members (annotation error)
  • If the CDU contains more than one head (annotation error) and if sloppy is True, return the textually leftmost one; otherwise, raise a MultiheadedCduException
Parameters:
  • cdu (CDU) – The CDU under examination.
  • sloppy (boolean, defaults to False) – If True, return the textually leftmost DU if the CDU contains more than one head ; if False, raise a MultiheadedCduException in such cases.
Returns:

cand – The head DU of this CDU ; it is None if no member of the CDU qualifies as a head (loop?).

Return type:

Unit or Schema? or None

first_outermost_dus()

Return discourse units in this graph, ordered by their starting point, and in case of a tie their inverse width (ie. widest first)

classmethod from_doc(corpus, doc_key, pred=<function <lambda>>)
is_cdu(x)
is_edu(x)
is_relation(x)
recursive_cdu_heads(sloppy=False)

A dictionary mapping each CDU to its recursive CDU head (see cdu_head)

sorted_first_outermost(annos)

Order nodes by their starting point, then inverse width.

Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse width (ie. widest first).

strip_cdus(sloppy=False, mode='head')

Delete all CDUs in this graph.

Links involving a CDU will point to/from the elements of this CDU. Non-head modes may add new edges to the graph.

Parameters:
  • sloppy (boolean, default=False) – See cdu_head.
  • mode (string, default='head') – Strategy for replacing edges involving CDUs. head will relocate the edge on the recursive head of the CDU (see recursive_cdu_heads). broadcast will distribute the edge over all EDUs belonging to the CDU. A copy of the edge will be created for each of them. If the edge’s source and target are both distributed, a new copy will be created for each combination of EDUs. custom (or any other string) will distribute or relocate on the head depending on the relation label.
without_cdus(sloppy=False, mode='head')

Return a deep copy of this graph with all CDUs removed. Links involving these CDUs will point instead from/to their deep heads

We’ll probably deprecate this function, since you could just as easily call deepcopy yourself

exception educe.stac.graph.MultiheadedCduException(cdu, *args, **kw)

Bases: exceptions.Exception

class educe.stac.graph.WrappedToken(token)

Bases: educe.annotation.Annotation

Thin wrapper around POS tagged token which adds a local_id field for use by the EnclosureGraph mechanism

educe.stac.postag module

STAC conventions for running a pos tagger, saving the results, and reading them.

educe.stac.postag.extract_turns(doc)

Return a string representation of the document’s turn text for use by a tagger

educe.stac.postag.read_tags(corpus, root_dir)

Read stored POS tagger output from a directory, and convert them to educe.annotation.Standoff objects.

Return a dictionary mapping ‘FileId’s to sets of tokens.

Parameters:
  • corpus (dict(FileId, GlozzDocument)) – Dictionary of documents keyed by their FileId.
  • root_dir (str) – Path to the directory containing the output of the POS tagger, one file per document.
Returns:

pos_tags – Map from each document id to the list of tokens predicted by a POS tagger.

Return type:

dict(FileId, list(Token))

educe.stac.postag.run_tagger(corpus, outdir, tagger_jar)

Run the ark-tweet-tagger on all the (unannotated) documents in the corpus and save the results in the specified directory

educe.stac.postag.sorted_by_span(annos)

Annotations sorted by text span

educe.stac.postag.tagger_cmd(tagger_jar, txt_file)

Command to run the POS tagger

educe.stac.postag.tagger_file_name(doc_key, root)

Get the file path to the output of the POS tagger for a document.

The returned file path is a .conll file within the given directory.

Parameters:
  • doc_key (educe.corpus.FileId) – FileId of the document
  • root (string) – Path to the folder containing annotations for this corpus, including the output of the POS tagger.
Returns:

res – Path to the .conll file output by the POS tagger.

Return type:

string

educe.stac.rfc module

Right frontier constraint and its variants

class educe.stac.rfc.BasicRfc(graph)

Bases: object

The vanilla right frontier constraint

1. X is textually last => RF(X)

2. Y
   | (sub)
   v
   X

   RF(Y) => RF(X)

3. X: +----+
      | Y  |
      +----+

   RF(Y) => RF(X)
frontier()

Return the list of nodes on the right frontier of the whole graph

violations()

Return a list of relation instance names, corresponding to the RF violations for the given graph.

You’ll need a stac graph object to interpret these names with.

Return type:[string]
class educe.stac.rfc.ThreadedRfc(graph)

Bases: educe.stac.rfc.BasicRfc

Same as BasicRfc except for point 1:

  1. X is the textual last utterance of any speaker => RF(X)
educe.stac.rfc.powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)
educe.stac.rfc.speakers(contexts, anno)

Returns the speakers for given annotation unit

Takes : contexts (Context dict), Annotation