educe.stac package¶
Conventions specific to the STAC project
This includes things like
- corpus layout (see corpus_files)
- which annotations are of interest
- renaming/deleting/collapsing annotation labels
Subpackages¶
Submodules¶
educe.stac.annotation module¶
STAC annotation conventions (re-exported in educe.stac)
STAC/Glozz annotations can be a bit confusing because for two reasons, first that Glozz objects are used to annotate very different things; and second that annotations are done on different stages
Stage 1 (units)
Glozz | Uses |
---|---|
units | doc structure, EDUs, resources, preferences |
relations | coreference |
schemas | composite resources |
Stage 2 (discourse)
Glozz | Uses |
---|---|
units | doc structure, EDUs |
relations | relation instances, coreference |
schemas | CDUs |
Units
There is a typology of unit types worth noting:
- doc structure : type eg. Dialogue, Turn, paragraph
- resources : subspans of segments (type Resource)
- preferences : subspans of segments (type Preference)
- EDUs : spans of text associated with a dialogue act (eg. type Offer, Accept) (during discourse stage, these are just type Segment)
Relations
- coreference : (type Anaphora)
- relation instances : links between EDUs, annotated with relation label (eg. type Elaboration, type Contrast, etc). These can be further divided in subordinating or coordination relation instances according to their label
Schemas
- composite resources : boolean combinations of resources (eg. “sheep or ore”)
- CDUs: type Complex_discourse_unit (discourse stage)
-
class
educe.stac.annotation.
PartialUnit
¶ Bases:
educe.stac.annotation.PartialUnit
Partially instantiated unit, for use when you want to programmatically insert annotations into a document
A partially instantiated unit does not have any metadata (creation date, etc); as these will be derived automatically
-
educe.stac.annotation.
RENAMES
= {'Strategic_comment': 'Other', 'Segment': 'Other'}¶ Dialogue acts that should be treated as a different one
-
class
educe.stac.annotation.
TurnId
¶ Bases:
tuple
Turn identifier akin to a Gorn address.
A Gorn address is a tuple of integers.
-
classmethod
from_string
(tid_str)¶ Create a TurnId from a string.
ex: (21.0.1)
-
classmethod
-
educe.stac.annotation.
addressees
(anno)¶ The set of people spoken to during an edu annotation
Annotation -> Set String
Note: this returns None if the value is the default ‘Please choose...’; but otherwise, it preserves values like ‘All’ or ‘?’.
-
educe.stac.annotation.
cleanup_comments
(anno)¶ Strip out default comment text from features. This placeholder text was inserted as a UI aid during editing in Glozz, but isn’t actually the comment itself
-
educe.stac.annotation.
create_units
(_, doc, author, partial_units)¶ Create a collection of units from their partial specification.
Parameters: - _ (anything) – Anonymous parameter whose value is ignored. It was apparently supposed to contain a FileId. I suppose the intention was to follow a signature similar to other functions.
- doc (Document) – Containing document.
- author (string) – Author for the new units.
- partial_units (iterable of PartialUnit) – Partial specification of the new units.
Returns: res – Collection of instantiated new unit objects.
Return type: list of Unit
Notes
As of 2016-05-11, this function does not seem to be used anymore in the codebase. It used to be called in irit-stac/segmentation/glozz-segment, which was deleted 2015-06-08 (commit e2373c03) because it was not used.
-
educe.stac.annotation.
dialogue_act
(anno)¶ Set of dialogue act (aka speech act) annotations for a Unit, taking into consideration STAC conventions like collapsing Strategic_comment into Other
By rights should be singleton set, but there used to be more than one, something we want to phase out?
-
educe.stac.annotation.
game_turns
(doc, turns, gen=2)¶ Group a sequence of turns into a sequence of game turns.
A game turn corresponds to the sequence of events (turns) that happen within a player’s turn (in the SOC game).
Parameters: - doc (Document) – Containing document.
- turns (list of educe.stac.Unit) – Events (of type Turn) from the game: server messages, player messages.
Returns: gturn_beg – Index of the first Turn of each game turn.
Return type: list of int
-
educe.stac.annotation.
is_cdu
(annotation)¶ See CDUs typology above
-
educe.stac.annotation.
is_coordinating
(annotation)¶ See Relation typology above
-
educe.stac.annotation.
is_dialogue
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_dialogue_act
(annotation)¶ Deprecated in favour of is_edu
-
educe.stac.annotation.
is_edu
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_paragraph
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_preference
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_relation_instance
(annotation)¶ See Relation typology above
-
educe.stac.annotation.
is_resource
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_structure
(annotation)¶ Is one of the document-structure annotations, something an annotator is expected not to edit, create, delete
-
educe.stac.annotation.
is_subordinating
(annotation)¶ See Relation typology above
-
educe.stac.annotation.
is_turn
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
is_turn_star
(annotation)¶ See Unit typology above
-
educe.stac.annotation.
relation_labels
(anno)¶ Set of relation labels (eg. Elaboration, Explanation), taking into consideration any applicable STAC-isms
-
educe.stac.annotation.
set_addressees
(anno, addr)¶ Set the addresee list for an annotation. If the value None is provided, the addressee list is deleted (if present)
(Iterable String, Annotation) -> IO ()
-
educe.stac.annotation.
speaker
(anno)¶ Return the speaker associated with a turn annotation. NB: crashes if there is none
-
educe.stac.annotation.
split_turn_text
(text)¶ STAC turn texts are prefixed with a turn number and speaker to help the annotators (eg. “379: Bob: I think it’s your go, Alice”).
Given the text for a turn, split the string into a prefix containing this turn/speaker information (eg. “379: Bob: ”), and a body containing the turn text itself (eg. “I think it’s your go, Alice”).
Mind your offsets! They’re based on the whole turn string.
-
educe.stac.annotation.
split_type
(anno)¶ An object’s type as a (frozen)set of items. You’re probably looking for educe.stac.dialogue_act instead.
-
educe.stac.annotation.
turn_id
(anno)¶ Get the turn identifier for a turn annotation (or None).
Parameters: anno (Annotation) – Annotation Returns: turn_id – Turn identifier ; None if the annotation has no feature ‘Identifier’. Return type: tuple(int) or None
-
educe.stac.annotation.
twin
(corpus, anno, stage='units')¶ Given an annotation in a corpus, retrieve the equivalent annotation (by local identifier) from a a different stage of the corpus. Return this “twin” annotation or None if it is not found
Note that the annotation’s origin must be set
The typical use of this would be if you have an EDU in the ‘discourse’ stage and need to get its ‘units’ stage equvialent to have its dialogue act.
Parameters: twin_doc – unit-level document to fish twin from (None if you want educe to search for it in the corpus; NB: corpus can be None if you supply this)
-
educe.stac.annotation.
twin_from
(doc, anno)¶ Given a document and an annotation, return the first annotation in the document with a matching local identifier.
educe.stac.context module¶
The dialogue and turn surrounding an EDU along with some convenient information about it
-
class
educe.stac.context.
Context
(turn, tstar, turn_edus, dialogue, dialogue_turns, doc_turns, tokens=None)¶ Bases:
object
Representation of the surrounding context for an EDU, basically the relevant enclosing annotations: turns, dialogues. The idea is potentially extend this to a somewhat richer notion of context, including things like a sentence count, etc.
Parameters: - turn – the turn surrounding this EDU
- tstar – the tstar turn surrounding this EDU (a tstar turn is a sort of virtual turn made by merging consecutive turns in a dialogue that have the same speaker)
- turn_edus – the EDUs in the this turn
- dialogue – the dialogue surrounding this EDU
- dialogue_turns – all the turns in the dialogue surrounding this EDU (non-empty, sorted by first-widest span)
- doc_turns – all the turns in the document
- tokens – (may not be present): tokens contained within this EDU
-
classmethod
for_edus
(doc, postags=None)¶ Get a dictionary of context objects for each EDU in the doc.
Returns: contexts – A dictionary with a context for each EDU in the document. Return type: dict(educe.glozz.Unit, Context)
-
speaker
()¶ the speaker associated with the turn surrounding an edu
-
educe.stac.context.
containing
(span, annos)¶ Given an iterable of standoff, pick just those that enclose/contain the given span (ie. are bigger and around)
-
educe.stac.context.
edus_in_span
(doc, span)¶ Given an document and a text span return the EDUs the document contains in that span
-
educe.stac.context.
enclosed
(span, annos)¶ Given an iterable of standoff, pick just those that are enclosed by the given span (ie. are smaller and within)
-
educe.stac.context.
merge_turn_stars
(doc)¶ Return a copy of the document in which consecutive turns by the same speaker have been merged.
Merging is done by taking the first turn in grouping of consecutive speaker turns, and stretching its span over all the subsequent turns.
Additionally turn prefix text (containing turn numbers and speakers) from the removed turns are stripped out.
-
educe.stac.context.
sorted_first_widest
(nodes)¶ Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse width (ie. widest first).
-
educe.stac.context.
speakers
(contexts, anno)¶ Return a list of speakers of an EDU or CDU (in the textual order of the EDUs).
-
educe.stac.context.
turns_in_span
(doc, span)¶ Given a document and a text span, return the turns that the document contains in that span
educe.stac.corenlp module¶
STAC conventions for running the Stanford CoreNLP pipeline, saving the results, and reading them.
The most useful functions here are
- run_pipeline
- read_results
-
educe.stac.corenlp.
from_corenlp_output_filename
(f)¶ Return a tuple of FileId and turn id.
This is entirely by convention we established when calling corenlp of course
-
educe.stac.corenlp.
parsed_file_name
(k, dir_name)¶ Given an educe.corpus.FileId and directory, return the file path within that directory that corresponds to the corenlp output
-
educe.stac.corenlp.
read_corenlp_result
(doc, corenlp_doc, tid=None)¶ Read CoreNLP’s output for a document.
Parameters: - doc (educe Document (?)) – The original document (?)
- corenlp_doc (educe.external.stanford_xml_reader.PreprocessingSource) – Object that contains all annotations for the document
- tid (turn id) – Turn id (?)
Returns: corenlp_doc – A CoreNlpDocument containing all information.
Return type:
-
educe.stac.corenlp.
read_results
(corpus, dir_name)¶ Read stored parser output from a directory, and convert them to educe.annotation.Standoff objects.
Return a dictionary mapping ‘FileId’s to sets of tokens.
-
educe.stac.corenlp.
run_pipeline
(corpus, outdir, corenlp_dir, split=False)¶ Run the standard corenlp pipeline on all the (unannotated) documents in the corpus and save the results in the specified directory.
If split=True, we output one file per turn, an experimental mode to account for switching between multiple speakers. We don’t have all the infrastructure to read these back in (it should just be a matter of some filename manipulation though) and hope to flesh this out later. We also intend to tweak the notion of splitting by aggregating consecutive turns with the same speaker, which may somewhat mitigate the loss of coreference information.
-
educe.stac.corenlp.
turn_id_text
(doc)¶ Return a list of (turn ids, text) tuples in span order (no speaker)
educe.stac.corpus module¶
Corpus layout conventions (re-exported by educe.stac)
-
class
educe.stac.corpus.
LiveInputReader
(corpusdir)¶ Bases:
educe.stac.corpus.Reader
Reader for unannotated ‘live’ data that we want to parse.
The data is assumed to be in a directory with one aa/ac file pair.
There is no notion of subdocument (subdoc = None) and the stage is ‘unannotated’
-
files
(doc_glob=None)¶ Parameters: doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘*’ for file basenames.
-
-
class
educe.stac.corpus.
Reader
(corpusdir)¶ Bases:
educe.corpus.Reader
See educe.corpus.Reader for details
-
files
(doc_glob=None)¶ Gather files for docs whose folder name matches doc_glob.
Parameters: doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘*’ to match all strings.
-
slurp_subcorpus
(cfiles, verbose=False)¶
-
-
educe.stac.corpus.
id_to_path
(k)¶ Given a fleshed out FileId (none of the fields are None), return a filepath for it following STAC conventions.
You will likely want to add your own filename extensions to this path
-
educe.stac.corpus.
is_metal
(fileid)¶ If the annotator is one of the distinguished standard annotators
-
educe.stac.corpus.
twin_key
(key, stage)¶ Given an annotation key, return a copy shifted over to a different stage.
Note that copying from unannotated to another stage, you will need to set the annotator
-
educe.stac.corpus.
write_annotation_file
(anno_filename, doc)¶ Write a GlozzDocument to XML in the given path
educe.stac.fake_graph module¶
Fake graphs for testing STAC algorithms
Specification for mini-language
Source string is parsed line by line, data type depends on first character Uppercase letters are speakers, lowercase letters are units EDU names are arranged following alphabetical order (does NOT apply to CDUs) Please arrange the lines in that order:
# : speaker line
# Aabce Bdg Cfh
any lowercase : CDU line (top-level last)
y(eg) x(wyz)
S or C : relation line
Sabd bf ceCh
anything else : skip as comment
-
class
educe.stac.fake_graph.
LightGraph
(src)¶ Structure holding only relevant information
Unit keys (sortable, hashable) must correspond to reading order CDUs can be placed in any position wrt their components
-
get_doc
()¶
-
get_edge
(source, target)¶ Return an educe.annotation.Relation for the given LightGraph names for source and target
-
get_node
(name)¶ Return an educe.annotation.Unit or Schema for the given LightGraph name
-
educe.stac.fusion module¶
Somewhat higher level representation of STAC documents than the usual Glozz layer.
Note that this is a relatively recent addition to Educe. Up to the time of this writing (2015-03), we had two options for dealing with STAC:
- manually manipulating glozz objects via educe.annotation
- dealing with some high-level but not particularly helpful hypergraph objects
We try to provide an intermediary in this layer by merging information from several layers in one place.
A typical example might be to print a listing of
(edu1_id, edu2_id, edu1_dialogue_act, edu2_dialogue_act, relation_label)
This has always been a bit awkward when dealing with Glozz, because there are separate annotations in different Glozz documents, the dialogue acts in the ‘units’ stage; and the linked units in the discourse stage. Combining these streams has always involved a certain amount of manual lookup, which we hope to avoid with this fusion layer.
At the time of this writing, this will have a bit of emphasis on feature extraction.
-
class
educe.stac.fusion.
Dialogue
(anno, edus, relations)¶ Bases:
object
STAC Dialogue.
Note that input EDUs should be sorted by span.
-
edu_pairs
()¶ Generate all EDU pairs within this dialogue.
This includes pairs whose source is the left padding (fake root) EDU.
Yields: (source, target) (tuple of educe.stac.annotation.Unit) – Next candidate edge, as a pair of EDUs (source, target).
-
-
class
educe.stac.fusion.
EDU
(doc, discourse_anno, unit_anno)¶ Bases:
educe.annotation.Unit
STAC EDU
A STAC EDU merges information from the unit and discourse annotation stages so that you can ignore the distinction between the two annotation stages.
It also tries to be usable as a drop-in substitute for both annotations and contexts
-
dialogue_act
()¶ The (normalised) speech act associated with this EDU (None if unknown)
-
fleshout
(context)¶ second phase of EDU initialisation; fill out contextual info
-
identifier
()¶ Some kind of identifier string that uniquely identfies the EDU in the corpus. Because these are higher level annotations than in the Glozz layer we will use the ‘local’ identifier, which should be the same across stages
-
is_left_padding
()¶ If this is a virtual EDU used in machine learning tasks
-
speaker
()¶ the speaker associated with the turn surrounding an edu
-
subgrouping
()¶ What abstract subgrouping the EDU is in (here: turn stars)
Returns: subgrouping Return type: string
-
text
()¶ The text for just this EDU
-
-
educe.stac.fusion.
ROOT
= 'ROOT'¶ distinguished fake EDU id for machine learning applications
-
educe.stac.fusion.
fuse_edus
(discourse_doc, unit_doc, postags)¶ Return a copy of the discourse level doc, merging info from both the discourse and units stage.
All EDUs will be converted to higher level EDUs.
Notes
- The discourse stage is primary in that we work by going over what EDUs we find in the discourse stage and trying to enhance them with information we find on their units-level equivalents. Sometimes (rarely but it happens) annotations can go out of synch. EDUs missing on the units stage will be silently ignored (we try to make do without them). EDUs that were introduced on the units stage but not percolated to discourse will also be ignored.
- We rely on annotation ids to match EDUs from both stages; it’s up to you to ensure that the annotations are really in synch.
- This does not constitute a full merge of the documents. For a full merge, you would have to bring over other annotations such as Resources, Preference, Anaphor, Several_resources, taking care all the while to ensure there are no timestamp clashes with pre-existing annotations (it’s unlikely but best be on the safe side if you ever find yourself with automatically generated annotations, where all bets are off time-stamp wise).
Parameters: - discourse_doc (GlozzDocument) – Document from the “discourse” stage.
- unit_doc (GlozzDocument) – Document from the “units” stage.
- postags (list of Token) – Sequence of educe tokens predicted by the POS tagger for this document.
Returns: doc – Deep copy of the discourse_doc with info from the units stage merged in.
Return type:
educe.stac.graph module¶
STAC-specific conventions related to graphs.
-
class
educe.stac.graph.
DotGraph
(anno_graph)¶ Bases:
educe.graph.DotGraph
A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here
-
class
educe.stac.graph.
EnclosureDotGraph
(core)¶ Bases:
educe.graph.EnclosureDotGraph
Conventions for visualising STAC enclosure graphs
-
class
educe.stac.graph.
EnclosureGraph
(doc, postags=None)¶ Bases:
educe.graph.EnclosureGraph
An enclosure graph based on STAC conventions
-
class
educe.stac.graph.
Graph
¶ Bases:
educe.graph.Graph
-
cdu_head
(cdu, sloppy=False)¶ Get the head DU of a CDU.
The head of a CDU is defined here as the only DU that is not pointed to by any other member of this CDU.
This is meant to approximate the description in (Muller 2012) (/Constrained decoding for text-level discourse parsing/):
- in the highest DU in its subgraph in terms of suboordinate relations,
- in case of a tie in #1, the leftmost in terms of coordinate relations.
Corner cases:
- Return None if the CDU has no members (annotation error)
- If the CDU contains more than one head (annotation error) and if sloppy is True, return the textually leftmost one; otherwise, raise a MultiheadedCduException
Parameters: - cdu (CDU) – The CDU under examination.
- sloppy (boolean, defaults to False) – If True, return the textually leftmost DU if the CDU contains more than one head ; if False, raise a MultiheadedCduException in such cases.
Returns: cand – The head DU of this CDU ; it is None if no member of the CDU qualifies as a head (loop?).
Return type: Unit or Schema? or None
-
first_outermost_dus
()¶ Return discourse units in this graph, ordered by their starting point, and in case of a tie their inverse width (ie. widest first)
-
classmethod
from_doc
(corpus, doc_key, pred=<function <lambda>>)¶
-
is_cdu
(x)¶
-
is_edu
(x)¶
-
is_relation
(x)¶
-
recursive_cdu_heads
(sloppy=False)¶ A dictionary mapping each CDU to its recursive CDU head (see cdu_head)
-
sorted_first_outermost
(annos)¶ Order nodes by their starting point, then inverse width.
Given a list of nodes, return the nodes ordered by their starting point, and in case of a tie their inverse width (ie. widest first).
-
strip_cdus
(sloppy=False, mode='head')¶ Delete all CDUs in this graph.
Links involving a CDU will point to/from the elements of this CDU. Non-head modes may add new edges to the graph.
Parameters: - sloppy (boolean, default=False) – See cdu_head.
- mode (string, default='head') – Strategy for replacing edges involving CDUs. head will relocate the edge on the recursive head of the CDU (see recursive_cdu_heads). broadcast will distribute the edge over all EDUs belonging to the CDU. A copy of the edge will be created for each of them. If the edge’s source and target are both distributed, a new copy will be created for each combination of EDUs. custom (or any other string) will distribute or relocate on the head depending on the relation label.
-
without_cdus
(sloppy=False, mode='head')¶ Return a deep copy of this graph with all CDUs removed. Links involving these CDUs will point instead from/to their deep heads
We’ll probably deprecate this function, since you could just as easily call deepcopy yourself
-
-
exception
educe.stac.graph.
MultiheadedCduException
(cdu, *args, **kw)¶ Bases:
exceptions.Exception
-
class
educe.stac.graph.
WrappedToken
(token)¶ Bases:
educe.annotation.Annotation
Thin wrapper around POS tagged token which adds a local_id field for use by the EnclosureGraph mechanism
educe.stac.postag module¶
STAC conventions for running a pos tagger, saving the results, and reading them.
-
educe.stac.postag.
extract_turns
(doc)¶ Return a string representation of the document’s turn text for use by a tagger
Read stored POS tagger output from a directory, and convert them to educe.annotation.Standoff objects.
Return a dictionary mapping ‘FileId’s to sets of tokens.
Parameters: - corpus (dict(FileId, GlozzDocument)) – Dictionary of documents keyed by their FileId.
- root_dir (str) – Path to the directory containing the output of the POS tagger, one file per document.
Returns: pos_tags – Map from each document id to the list of tokens predicted by a POS tagger.
Return type:
-
educe.stac.postag.
run_tagger
(corpus, outdir, tagger_jar)¶ Run the ark-tweet-tagger on all the (unannotated) documents in the corpus and save the results in the specified directory
-
educe.stac.postag.
sorted_by_span
(annos)¶ Annotations sorted by text span
-
educe.stac.postag.
tagger_cmd
(tagger_jar, txt_file)¶ Command to run the POS tagger
-
educe.stac.postag.
tagger_file_name
(doc_key, root)¶ Get the file path to the output of the POS tagger for a document.
The returned file path is a .conll file within the given directory.
Parameters: - doc_key (educe.corpus.FileId) – FileId of the document
- root (string) – Path to the folder containing annotations for this corpus, including the output of the POS tagger.
Returns: res – Path to the .conll file output by the POS tagger.
Return type: string
educe.stac.rfc module¶
Right frontier constraint and its variants
-
class
educe.stac.rfc.
BasicRfc
(graph)¶ Bases:
object
The vanilla right frontier constraint
1. X is textually last => RF(X) 2. Y | (sub) v X RF(Y) => RF(X) 3. X: +----+ | Y | +----+ RF(Y) => RF(X)
-
frontier
()¶ Return the list of nodes on the right frontier of the whole graph
-
violations
()¶ Return a list of relation instance names, corresponding to the RF violations for the given graph.
You’ll need a stac graph object to interpret these names with.
Return type: [string]
-
-
class
educe.stac.rfc.
ThreadedRfc
(graph)¶ Bases:
educe.stac.rfc.BasicRfc
Same as BasicRfc except for point 1:
- X is the textual last utterance of any speaker => RF(X)
-
educe.stac.rfc.
powerset
([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)¶
-
educe.stac.rfc.
speakers
(contexts, anno)¶ Returns the speakers for given annotation unit
Takes : contexts (Context dict), Annotation