educe.stac.oneoff package

Toolkit for one-off corpus-editing operations, things we don’t expect to come up very frequently, like mass renames of one annotation type to another

Submodules

educe.stac.oneoff.weave module

Combining annotations from an augmented ‘source’ document (with likely extra text) with those in a ‘target’ document. This involves copying missing annotations over and shifting the text spans of any matching documents

class educe.stac.oneoff.weave.Updates

Bases: educe.stac.oneoff.weave.Updates

Expected updates to the target document.

We expect to see four types of annotation:

  1. target annotations for which there exists a source annotation in the equivalent span
  2. target annotations for which there is no equivalent source annotation (eg. Resources, Preferences, but also annotation moves)
  3. source annotations for which there is at least one target annotation at the equivalent span (the mirror to case 1; note that these are not represented in this structure because we don’t need to say much about them)
  4. source annotations for which there is no match in the target side
  5. source annotations that lie in between the matching bits of text
Parameters:
  • shift_if_ge (dict(int, int)) – (case 1 and 2) shift points and offsets for characters in the target document (see shift_spans)
  • abnormal_src_only ([Annotation]) – (case 4) annotations that only occur in the source document (weird, found in matches)
  • abnormal_tgt_only ([Annotation]) – (case 2) annotations that only occur in the target document (weird, found in matches)
  • [Annotation] (expected_src_only) – (case 5) annotations that only occur in the source doc (ok, found in gaps)
map(fun)

Return an Updates in which a function has been applied to all annotations in this one (eg. useful for previewing), and to all spans

exception educe.stac.oneoff.weave.WeaveException(*args, **kw)

Bases: exceptions.Exception

Unexpected alignment issues between the source and target document

educe.stac.oneoff.weave.check_matches(tgt_doc, matches, strict=True)

Check that the target document text is indeed a subsequence of the source document text (the source document is expected to be “augmented” version of the target with new text interspersed throughout)

Parameters:
  • tgt_doc
  • matches (list of (int, int, int)) – List of triples (i, j, n) representing matching subsequences: a[i:i+n] == b[j:j+n]. See difflib.SequenceMatcher.get_matching_blocks.
  • strict (boolean) – If True, raise an exception if there are match gaps in the target document, otherwise just print the gaps to stderr.
educe.stac.oneoff.weave.compute_structural_updates(src_doc, tgt_doc, matches, updates, verbose=0)

Transfer structural annotations from tgt_doc to src_doc.

This is the transposition of compute_updates to structural units (dialogues only, for the moment).

educe.stac.oneoff.weave.compute_updates(src_doc, tgt_doc, matches)

Return updates that would need to be made on the target document.

Given matches between the source and target document, return span updates along with any source annotations that do not have an equivalent in the target document (the latter may indicate that resegmentation has taken place, or that there is some kind of problem)

Parameters:
Returns:

updates

Return type:

Updates

educe.stac.oneoff.weave.find_continuous_seqs(doc, spans, annos)

Find continuous sequences of annotations, ignoring whitespaces.

Parameters:
  • doc (Document) – Annotated document
  • spans (list of Span) – Spans that support the annotations
  • annos (list of Annotation) – Annotations of interest
  • ignore_whitespaces (boolean, optional) – If True, whitespaces are ignored when assessing continuity.
Returns:

seqs – List of sequences of indices (in annos and spans)

Return type:

list of list of integers

educe.stac.oneoff.weave.hollow_out_missing_turn_text(src_doc, tgt_doc, doc_span_src=None, doc_span_tgt=None)

Return a version of the source text where all characters in turns present in src_doc but not in tgt_doc are replaced with a nonsense char (tab).

Parameters:

Notes

We use difflib’s SequenceMatcher to compare the original (but annotated) corpus against the augmented corpus containing nonplayer turns. This gives us the ability to shift annotation spans into the appropriate place within the augmented corpus. By rights the diff should yield only inserts (of the nonplayer turns). But if the inserted text should happen to have the same sorts of substrings as you might find in the rest of corpus, the diff algorithm can be fooled.

educe.stac.oneoff.weave.shift_char(position, updates)

Given a character position an updates tuple, return a shifted over position which reflects the update.

The basic idea that we have a set of “shift points” and their corresponding offsets. If a character position ‘c’ occurs after one of the points, we take the offset of the largest such point and add it to the character.

Our assumption here is that the update always consists in adding more text so offsets are always positive.

Parameters:
  • position (int) – initial position
  • updates (Updates) –
Returns:

shifted position

Return type:

int

educe.stac.oneoff.weave.shift_dialogues(doc_src, doc_res, updates, gen)

Transpose dialogue split from target to source document.

Remove all dialogues from updates.

Parameters:
  • doc_src (Document) – Source (augmented) document.
  • doc_res (Document) – Result document, originally a copy of doc_tgt with unshifted annotations. This function modifies doc_res by shifting the boundaries of its dialogues according to updates, and stretching the first and last dialogues so as to cover the same span as dialogues from doc_src.
  • updates (set of updates) – Updates computed by compute_updates.
  • gen (int) – Generation of annotations included in doc_src and the output.
Returns:

updates – Trimmed down set of updates: no more dialogue.

Return type:

Updates

educe.stac.oneoff.weave.shift_span(span, updates, stretch_right=False)

Given a span and an updates tuple, return a Span that is shifted over to reflect the updates

Parameters:
  • span (Span) –
  • updates (Updates) –
  • stretch_right (boolean, optional) – If True, stretch the right boundary of an annotation that buts up against the left of a new annotation. This is recommended for annotations that should fully cover a given span, like dialogues for documents.
Returns:

span

Return type:

Span

See also

shift_char()
for details on how this works
educe.stac.oneoff.weave.src_gaps(matches)

Given matches between the source and target document, return the spaces between these matches as source offset and size (a bit like the matches). Note that we assume that the target document text is a subsequence of the source document.

educe.stac.oneoff.weave.stretch_match(updates, src_doc, tgt_doc, doc_span_src, doc_span_tgt, annos_src, annos_tgt, verbose=0)

Compute stretch matches between annos_src and annos_tgt.

Parameters:
  • updates (Update) –
  • src_doc (Document) –
  • tgt_doc (Document) –
  • doc_span_src (Span) –
  • doc_span_tgt (Span) –
  • annos_src (list of educe.annotation) – Unmatched annotations in span_src.
  • annos_tgt (list of educe.annotation) – Unmatched annotations in span_tgt.
  • verbose (int) – Verbosity level
Returns:

res – Possibly trimmed version of updates.

Return type:

Update

educe.stac.oneoff.weave.stretch_match_many(updates, src_doc, tgt_doc, doc_span_src, doc_span_tgt, annos_src, annos_tgt, verbose=0)

Compute n-m stretch matches between annos_src and annos_tgt.

Parameters:
  • updates (Update) –
  • src_doc (Document) –
  • tgt_doc (Document) –
  • doc_span_src (Span) –
  • doc_span_tgt (Span) –
  • annos_src (list of educe.annotation) – Unmatched annotations in span_src.
  • annos_tgt (list of educe.annotation) – Unmatched annotations in span_tgt.
  • verbose (int) – Verbosity level
Returns:

res – Possibly trimmed version of updates.

Return type:

Update

educe.stac.oneoff.weave.tgt_gaps(matches)

Given matches between the source and target document, return the spaces between these matches as target offset and size (a bit like the matches). By rights this should be empty, but you never know

educe.stac.oneoff.weave.update_updates(updates, annos_src, annos_tgt, verbose=0)

Update the sets of updates given a match (annos_src, annos_tgt).

Parameters:
  • updates (Updates) – Summary of extra updates between source and target.
  • annos_src (list of Annotation) – Matched annotations from source doc.
  • annos_tgt (list of Annotation) – Matched annotations from target doc.
  • verbose (int) – Verbosity.
Returns:

updatesupdates updated to take the given match into account.

Return type:

Updates