educe.stac.util package

Submodules

educe.stac.util.annotate module

Readable text dumps of educe annotations.

The idea here is to dump the text to screen, and use some informal text markup to show annotations over the text. There’s a limit to how much we can display, but just breaking things up into paragraphs and [segments] seems to go a long way.

educe.stac.util.annotate.annotate(txt, annotations, inserts=None)

Decorate a text with arbitrary bracket symbols, as a visual guide to the annotations on that text. For example, in a chat corpus, you might use newlines to indicate turn boundaries and square brackets for segments.

Parameters:
  • inserts – inserts a dictionary from annotation type to pair of its opening/closing bracket
  • FIXME (this needs to become a standard educe utility,) –
  • as part of the educe.annotation layer? (maybe) –
educe.stac.util.annotate.annotate_doc(doc, span=None)

Pretty print an educe document and its annotations.

See the lower-level annotate for more details

educe.stac.util.annotate.reflow(text, width=40)

Wrap some text, at the same time ensuring that all original linebreaks are still in place

educe.stac.util.annotate.rough_type(anno)

Simplify STAC annotation types

educe.stac.util.annotate.schema_text(doc, anno)

(recursive) text preview of a schema and its contents. Members are enclosed in square brackets.

educe.stac.util.annotate.show_diff(doc_before, doc_after, span=None)

Display two educe documents (presumably two versions of the “same”) side by side

educe.stac.util.args module

Command line options

educe.stac.util.args.add_commit_args(parser)

Augment a subcommand argparser with an option to emit a commit message for your version control tracking

educe.stac.util.args.add_usual_input_args(parser, doc_subdoc_required=False, help_suffix=None)

Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require slightly different input arguments, in which case, just don’t call this function.

Parameters:
  • parser (ArgumentParser) – Argument parser.
  • doc_subdoc_required (bool, defaults to False) – force user to supply –doc/–subdoc for this subcommand (note you’ll need to add stage/anno yourself)
  • help_suffix (string, defaults to None) – appended to –doc/–subdoc help strings
educe.stac.util.args.add_usual_output_args(parser, default_overwrite=False)

Augment a subcommand argparser with typical output arguments, Sometimes your subcommand may require slightly different output arguments, in which case, just don’t call this function.

educe.stac.util.args.anno_id(string)

Split AUTHOR_DATE string into tuple, complaining if we don’t have such a string. Used for argparse

educe.stac.util.args.announce_output_dir(output_dir)

Tell the user where we saved the output

educe.stac.util.args.check_easy_settings(args)

Modify args to reflect user-friendly defaults.

Terminates the program if args.corpus is set but does not point to an existing folder ; otherwise args.doc must be set and everything else is expected to be empty.

Notes

All callers for this function are in the scripts folder of the educe repository: scripts/stac-{util,edit,oneoff}.

Parameters:args (Namespace) – Arguments of the argparser.

See also

educe.stac.sanity.main.easy_settings()

educe.stac.util.args.comma_span(string)

Split a comma delimited pair of integers into an educe span

educe.stac.util.args.get_output_dir(args, default_overwrite=False)

Return the output dir specified or inferred from command line args.

We try the following in order:

  1. If –output is given explicitly, we’ll just use/create that
  2. If default_overwrite is True, or the user specifies –overwrite on the command line (provided the command supports it), the output directory may well be the original corpus dir (gulp! Better use version control!)
  3. OK just make a temporary directory. Later on, you’ll probably want to call announce_output_dir.
educe.stac.util.args.read_corpus(args, preselected=None, verbose=True)

Read the section of the corpus specified in the command line arguments.

educe.stac.util.args.read_corpus_with_unannotated(args, verbose=True)

Read the section of the corpus specified in the command line arguments.

educe.stac.util.csv module

educe.stac.util.doc module

Utilities for large-scale changes to educe documents, for example, moving a chunk of text from one document to another

exception educe.stac.util.doc.StacDocException(msg)

Bases: exceptions.Exception

An exception that arises from trying to manipulate a stac document (typically moving things around, etc)

educe.stac.util.doc.compute_renames(avoid, incoming)

Given two sets of documents (i.e. corpora), return a dictionary which would allow us to rename ids in incoming so that they do not overlap with those in avoid.

:rtype author -> date -> date

educe.stac.util.doc.evil_set_id(anno, author, date)

This is a bit evil as it’s using undocumented functionality from the educe.annotation.Standoff object

educe.stac.util.doc.evil_set_text(doc, text)

This is a bit evil as it’s using undocumented functionality from the educe.annotation.Document object

educe.stac.util.doc.move_portion(renames, src_doc, tgt_doc, src_split, tgt_split=-1)

Move part of the source document into the target document.

This returns an updated copy of both the source and target documents.

This can capture a couple of patterns:

  • reshuffling the boundary between the target and source document (if tgt | src1 src2 ==> tgt src1 | src2) (tgt_split = -1)
  • prepending the source document to the target (src | tgt ==> src tgt; src_split=-1; tgt_split=0)
  • inserting the whole source document into the other (tgt1 tgt2 + src ==> tgt1 src tgt2; src_split=-1)

There’s a bit of potential trickiness here:

  • we’d like to preserve the property that text has a single starting and ending space (no real reason just seems safer that way)
  • if we’re splicing documents together particularly at their respective ends, there’s a strong off-by-one risk because some annotations span the whole text (whitespace and all), particularly dialogues
Parameters:
  • renames (TODO) – TODO
  • src_doc (Document) – Source document
  • tgt_doc (Document) – Target document
  • src_split (int) – Split point for src_doc.
  • tgt_split (int, defaults to -1) – Split point for tgt_doc.
Returns:

  • new_src_doc (Document) – TODO
  • new_tgt_doc (Document) – TODO

educe.stac.util.doc.narrow_to_span(doc, span)

Return a deep copy of a document with only the text and annotations that are within the span specified by portion.

educe.stac.util.doc.rename_ids(renames, doc)

Return a deep copy of a document, with ids reassigned according to the renames dictionary

educe.stac.util.doc.retarget(doc, old_id, new_anno)

Replace all links to the old (unit-level) annotation with links to the new one.

We refer to the old annotation by id, but the new annotation must be passed in as an object. It must also be either an EDU or a CDU.

Return True if we replaced anything

educe.stac.util.doc.shift_annotations(doc, offset, point=None)

Return a deep copy of a document such that all annotations have been shifted by an offset.

If shifting right, we pad the document with whitespace to act as filler. If shifting left, we cut the text.

If a shift point is specified and the offset is positive, we only shift annotations that are to the right of the point. Likewise if the offset is negative, we only shift those that are to the left of the point.

educe.stac.util.doc.split_doc(doc, middle)

Given a split point, break a document into two pieces.

If the split point is None, we take the whole document (this is slightly different from having -1 as a split point)

Raise an exception if there are any annotations that span the point.

Parameters:
  • doc (Document) – The document we want to split.
  • middle (int) – Split point.
Returns:

  • doc_prefix (Document) – Deep copy of doc restricted to span [:middle]
  • doc_suffix (Document) – Deep copy of doc restricted to span [middle:] ; the span of each annotation is shifted to match the new text.

educe.stac.util.doc.strip_fixme(act)

Remove the fixme string from a dialogue act annotation. These were automatically inserted when there is an annotation to review. We shouldn’t see them for any use cases like feature extraction though.

See educe.stac.dialogue_act which returns the set of dialogue acts for each annotation (by rights should be singleton set, but there used to be more than one, something we want to phase out?)

educe.stac.util.doc.unannotated_key(key)

Given a corpus key, return a copy of that equivalent key in the unannotated portion of the corpus (the parser outputs objects that are based in unannotated)

educe.stac.util.glozz module

STAC Glozz conventions

class educe.stac.util.glozz.PseudoTimestamper

Bases: object

Generator for the fake timestamps used as a Glozz IDs

next()

Fresh timestamp

class educe.stac.util.glozz.TimestampCache

Bases: object

Generates and stores a unique timestamp entry for each key. You can use any hashable key, for exmaple, a span, or a turn id.

get(tid)

Return a timestamp for this turn id, either generating and caching (if unseen) or fetching from the cache

reset()

Empty the cache (but maintain the timestamper state, so that different documents get different timestamps; the difference in timestamps is not mission-critical but potentially nice)

educe.stac.util.glozz.anno_author(anno)

Annotation author

educe.stac.util.glozz.anno_date(anno)

Annotation creation date as an int

educe.stac.util.glozz.anno_id_from_tuple(author_date)

Glozz string representation of authors and dates (AUTHOR_DATE)

educe.stac.util.glozz.anno_id_to_tuple(string)

Read a Glozz string representation of authors and dates into a pair (date represented as an int, ms since 1970?)

educe.stac.util.glozz.get_turn(tid, doc)

Return the turn annotation with the desired ID

educe.stac.util.glozz.is_dialogue(anno)

If a Glozz annotation is a STAC dialogue.

educe.stac.util.glozz.set_anno_author(anno, author)

Replace the annotation author the given author

educe.stac.util.glozz.set_anno_date(anno, date)

Replace the annotation creation date with the given integer

educe.stac.util.output module

Help writing out corpus files

educe.stac.util.output.mk_parent_dirs(filename)

Given a filepath that we want to write, create its parent directory as needed.

educe.stac.util.output.output_path_stub(odir, k)

Given an output directory and an educe corpus key, return a ‘stub’ output path in that directory. This is dirname and basename only; you probably want to tack a suffix onto it.

Example: given something like “/tmp/foo” and a key like {author:”bob”, stage:units, doc:”pilot03”, subdoc:”07”} you might get something like /tmp/foo/pilot03/units/pilot03_07)

educe.stac.util.output.save_document(output_dir, k, doc)

Save a document as a Glozz .ac/.aa pair

educe.stac.util.output.write_dot_graph(doc_key, odir, dot_graph, part=None, run_graphviz=True)

Write a dot graph and possibly run graphviz on it

educe.stac.util.prettifyxml module

Function to “prettify” XML: courtesy of http://www.doughellmann.com/PyMOTW/xml/etree/ElementTree/create.html

educe.stac.util.prettifyxml.prettify(elem, indent='')

Return a pretty-printed XML string for the Element.

educe.stac.util.showscores module

class educe.stac.util.showscores.Score(reference, test)

Precision/recall type scores for a given data set.

This class is really just about holding on to sets of things. The actual maths is handled by NLTK.

f_measure()
missing()
precision()
recall()
shared()
spurious()
educe.stac.util.showscores.banner(t)
educe.stac.util.showscores.show_multi(k, score)
educe.stac.util.showscores.show_pair(k, score)