educe.stac.sanity.checks package

Submodules

educe.stac.sanity.checks.annotation module

STAC sanity-check: annotation oversights

class educe.stac.sanity.checks.annotation.FeatureItem(doc, contexts, anno, attrs, status='missing')

Bases: educe.stac.sanity.common.ContextItem

Annotations that are missing some feature(s)

annotations()
html()
educe.stac.sanity.checks.annotation.is_blank_edu(anno)

True if the annotation looks like it may be an unannotated EDU

educe.stac.sanity.checks.annotation.is_cross_dialogue(contexts)

The units connected by this relation (or cdu) do not inhabit the same dialogue.

educe.stac.sanity.checks.annotation.is_fixme(feature_value)

True if a feature value has a fixme value

educe.stac.sanity.checks.annotation.is_review_edu(anno)

True if the annotation has a FIXME tagged type

educe.stac.sanity.checks.annotation.missing_features(doc, anno)

Return set of attribute names for any expected features that may be missing for this annotation

educe.stac.sanity.checks.annotation.run(inputs, k)

Add any annotation omission errors to the current report

educe.stac.sanity.checks.annotation.search_for_fixme_features(inputs, k)

Return a ReportItem for any annotations in the document whose features have a fixme type

educe.stac.sanity.checks.annotation.search_for_missing_rel_feats(inputs, k)

Return ReportItems for any relations that are missing expected features

educe.stac.sanity.checks.annotation.search_for_missing_unit_feats(inputs, k)

Return ReportItems for any EDUs and CDUs that are missing expected features

educe.stac.sanity.checks.annotation.search_for_unexpected_feats(inputs, k)

Return ReportItems for any annotations that are have features we were not expecting them to have

educe.stac.sanity.checks.annotation.unexpected_features(_, anno)

Return set of attribute names for any features that we were not expecting to see in the given annotations

educe.stac.sanity.checks.glozz module

Sanity checker: low-level Glozz errors

class educe.stac.sanity.checks.glozz.BadIdItem(doc, contexts, anno, expected_id)

Bases: educe.stac.sanity.common.ContextItem

An annotation whose identifier does not match its metadata

text()
class educe.stac.sanity.checks.glozz.DuplicateItem(doc, contexts, anno, others)

Bases: educe.stac.sanity.common.ContextItem

An annotation which shares an id with another

text()
class educe.stac.sanity.checks.glozz.IdMismatch(doc, contexts, unit1, unit2)

Bases: educe.stac.sanity.common.ContextItem

An annotation which seems to have an equivalent in some twin but with the wrong identifier

annotations()
html()
exception educe.stac.sanity.checks.glozz.MissingDocumentException(k)

Bases: exceptions.Exception

A document we are trying to cross check does not have the expected twin

class educe.stac.sanity.checks.glozz.MissingItem(status, doc1, contexts1, unit, doc2, contexts2, approx)

Bases: educe.stac.sanity.report.ReportItem

An annotation which is missing in some document twin (or which looks like it may have been unexpectedly added)

excess_status = 'ADDED'
html()
missing_status = 'DELETED'
status_len = 7
text_span()

Return the span for the annotation in question

class educe.stac.sanity.checks.glozz.OffByOneItem(doc, contexts, unit)

Bases: educe.stac.sanity.common.UnitItem

An annotation whose boundaries might be off by one

html()
html_turn_info(parent, turn)

Given a turn annotation, append a prettified HTML representation of the turn text (highlighting parts of it, such as the turn number)

class educe.stac.sanity.checks.glozz.OverlapItem(doc, contexts, anno, overlaps)

Bases: educe.stac.sanity.common.ContextItem

An annotation whose span overlaps with that of another

annotations()
html()
educe.stac.sanity.checks.glozz.bad_ids(inputs, k)

Return annotations whose identifiers do not match their metadata

educe.stac.sanity.checks.glozz.check_unit_ids(inputs, key1, key2)

Return annotations that match in the two documents modulo identifiers. This might arise if somebody creates a duplicate annotation in place and annotates that

educe.stac.sanity.checks.glozz.cross_check_against(inputs, key1, stage='unannotated')

Compare annotations with their equivalents on a twin document in the corpus

educe.stac.sanity.checks.glozz.cross_check_units(inputs, key1, key2, status)

Return tuples for certain corpus[key1] units not present in corpus[key2]

educe.stac.sanity.checks.glozz.duplicate_annotations(inputs, k)

Multiple annotations with the same local_id()

educe.stac.sanity.checks.glozz.filter_matches(unit, other_units)

Return any unit-level annotations in other_units that look like they may be the same as the given annotation

educe.stac.sanity.checks.glozz.is_maybe_off_by_one(text, anno)

True if an annotation has non-whitespace characters on its immediate left/right

educe.stac.sanity.checks.glozz.overlapping(inputs, k, is_overlap)

Return items for annotations that have overlaps

educe.stac.sanity.checks.glozz.overlapping_structs(inputs, k)

Return items for structural annotations that have overlaps

educe.stac.sanity.checks.glozz.run(inputs, k)

Add any glozz errors to the current report

educe.stac.sanity.checks.glozz.search_glozz_off_by_one(inputs, k)

EDUs which have non-whitespace (or boundary) characters either on their right or left

educe.stac.sanity.checks.graph module

Sanity checker: fancy graph-based errors

educe.stac.sanity.checks.graph.BACKWARDS_WHITELIST = ['Conditional']

relations that are allowed to go backwards

class educe.stac.sanity.checks.graph.CduOverlapItem(doc, contexts, anno, cdus)

Bases: educe.stac.sanity.common.ContextItem

EDUs that appear in more than one CDU

annotations()
html()
educe.stac.sanity.checks.graph.PAIRS_WHITELIST = [('Contrast', 'Comment'), ('Narration', 'Result'), ('Narration', 'Continuation'), ('Parallel', 'Continuation'), ('Parallel', 'Background'), ('Comment', 'Acknowledgement'), ('Parallel', 'Acknowledgement'), ('Question-answer_pair', 'Contrast'), ('Question-answer_pair', 'Parallel')]

pairs of relations that are explicitly allowed between the same source/target DUs

educe.stac.sanity.checks.graph.are_single_headed_cdus(inputs, k, gra)

Check that each CDU has exactly one head DU.

Parameters:gra (Graph) – Graph for the discourse structure.
Returns:report_items – List of report items, one per faulty CDU.
Return type:list of ReportItem
educe.stac.sanity.checks.graph.dialogue_graphs(k, doc, contexts)

Return a dict from dialogue annotations to subgraphs containing at least everything in that dialogue (and perhaps some connected items).

Parameters:
  • k (FileId) – File identifier
  • doc (TODO) – TODO
  • contexts (dict(Annotation, Context)) – Context for each annotation.
Returns:

graphs – Graph for each dialogue.

Return type:

dict(Dialogue, Graph)

Notes

MM: I could not find any caller for this function in either educe or irit-stac, as of 2017-03-17.

educe.stac.sanity.checks.graph.horrible_context_kludge(graph, simplified_graph, contexts)

Given a graph and its copy, and given a context dictionary, return a copy of the context dictionary that corresponds to the simplified graph. Ugh

educe.stac.sanity.checks.graph.is_arrow_inversion(gra, _, rel)

Relation in a graph that goes from textual right to left (may not be a problem)

educe.stac.sanity.checks.graph.is_bad_relset(gra, contexts, relset)

True if a set of relation instances has more than one member and it is not whitelisted.

Parameters:
  • gra (Graph) – Graph for the discourse structure.
  • contexts (TODO) – TODO
  • relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns:

res – True if relset contains more than one element and is_whitelisted_relpair returns False.

Return type:

boolean

educe.stac.sanity.checks.graph.is_disconnected(gra, contexts, node)

Return True if an EDU is disconnected from a discourse structure.

An EDU is considered disconnected unless:

  • it has an incoming link or
  • it has an outgoing Conditional link or
  • it’s at the beginning of a dialogue

In principle we don’t need to look at EDUs that are disconnected on the outgoing end because (1) it can be legitimate for non-dialogue-ending EDUs to not have outgoing links and (2) such information would be redundant with the incoming anyway.

educe.stac.sanity.checks.graph.is_dupe_rel(gra, _, rel)

Relation instance for which there are relation instances between the same source/target DUs (regardless of direction)

educe.stac.sanity.checks.graph.is_non2sided_rel(gra, _, rel)

Relation instance which does not have exactly a source and target link in the graph

How this can possibly happen is a mystery

educe.stac.sanity.checks.graph.is_puncture(gra, _, rel)

Relation in a graph that traverse a CDU boundary

educe.stac.sanity.checks.graph.is_weird_ack(gra, contexts, rel)

Relation in a graph that represent a question answer pair which either does not start with a question, or which ends in a question.

Note the detection process is a lot sloppier when one of the endpoints is a CDU. If all EDUs in the CDU are by the same speaker, we can check as usual; otherwise, all bets are off, so we ignore the relation.

Note: slightly curried to accept contexts as an argument

educe.stac.sanity.checks.graph.is_weird_qap(gra, contexts, rel)

Return True if rel is a weird Question-Answer Pair relation.

Parameters:
  • gra (TODO) – Graph?
  • contexts (TODO) – Surrounding context
  • rel (TODO) – Relation.
Returns:

res – True if rel is a relation that represents a question answer pair which either does not start with a question, or which ends in a question.

Return type:

boolean

educe.stac.sanity.checks.graph.is_whitelisted_relpair(gra, _, relset)

True if a pair of instance relations is in PAIRS_WHITELIST.

Parameters:
  • gra (Graph) – Graph for the discourse structure.
  • contexts (TODO) – TODO
  • relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns:

res – True if relset is a pair of relation instances with the same direction and the corresponding pair of relations is explicitly allowed in the whitelist.

Return type:

boolean

return ReportItem for a graph relation

educe.stac.sanity.checks.graph.rfc_violations(inputs, k, gra)

Repackage right frontier contraint violations in a somewhat friendlier way

educe.stac.sanity.checks.graph.run(inputs, k)

Add any graph errors to the current report

educe.stac.sanity.checks.graph.search_graph_cdu_overlap(inputs, k, gra)

Return a ReportItem for every EDU that appears in more than one CDU

educe.stac.sanity.checks.graph.search_graph_cdus(inputs, k, gra, pred)

Return a ReportItem for any CDU in the graph for which the given predicate is True

educe.stac.sanity.checks.graph.search_graph_edus(inputs, k, gra, pred)

Return a ReportItem for any EDU within the graph for which some predicate is true

educe.stac.sanity.checks.graph.search_graph_relations(inputs, k, gra, pred)

Return a ReportItem for any relation instance within the graph for which some predicate is true

educe.stac.sanity.checks.graph.search_graph_relations_same_dus(inputs, k, gra, pred)

Return a list of ReportItem (one per member of the set) for any set of relation instances within the graph for which some predicate is True.

Parameters:
  • inputs (educe.stac.sanity.main.SanityChecker) – SanityChecker, with attributes corpus and contexts.
  • k (FileId) – Identifier of the desired Glozz document.
  • gra (educe.stac.graph.Graph) – Graph that corresponds to the discourse structure (?).
  • pred (function from (gra, contexts, rel_set) to boolean) – Predicate function.
Returns:

report_items – One ReportItem for each relation instance that belongs to a set of instances, on the same DUs, where pred is True.

Return type:

list of ReportItem

educe.stac.sanity.checks.type_err module

STAC sanity-check: type errors

educe.stac.sanity.checks.type_err.has_non_du_member(anno)

True if anno is a relation that points to another relation, or if it’s a CDU that has relation members

educe.stac.sanity.checks.type_err.is_non_du(anno)

True if the annotation is neither an EDU nor a CDU

educe.stac.sanity.checks.type_err.is_non_preference(anno)

True if the annotation is NOT a preference

educe.stac.sanity.checks.type_err.is_non_resource(anno)

True if the annotation is NOT a resource

educe.stac.sanity.checks.type_err.run(inputs, k)

Add any annotation type errors to the current report

educe.stac.sanity.checks.type_err.search_anaphora(inputs, k, pred)

Return a ReportItem for any anaphora annotation in which at least one member (not the annotation itself) is true with the given predicate

educe.stac.sanity.checks.type_err.search_preferences(inputs, k, pred)

Return a ReportItem for any Preferences schema which has at least one member for which the predicate is True

educe.stac.sanity.checks.type_err.search_resource_groups(inputs, k, pred)

Return a ReportItem for any Several_resources schema which has at least one member for which the predicate is True