educe.stac.sanity.checks package¶

Submodules¶

educe.stac.sanity.checks.annotation module¶

STAC sanity-check: annotation oversights

class educe.stac.sanity.checks.annotation.FeatureItem(doc, contexts, anno, attrs, status='missing')¶

Bases: educe.stac.sanity.common.ContextItem

Annotations that are missing some feature(s)

annotations()¶

html()¶

educe.stac.sanity.checks.annotation.is_blank_edu(anno)¶: True if the annotation looks like it may be an unannotated EDU

educe.stac.sanity.checks.annotation.is_cross_dialogue(contexts)¶: The units connected by this relation (or cdu) do not inhabit the same dialogue.

educe.stac.sanity.checks.annotation.is_fixme(feature_value)¶: True if a feature value has a fixme value

educe.stac.sanity.checks.annotation.is_review_edu(anno)¶: True if the annotation has a FIXME tagged type

educe.stac.sanity.checks.annotation.missing_features(doc, anno)¶: Return set of attribute names for any expected features that may be missing for this annotation

educe.stac.sanity.checks.annotation.run(inputs, k)¶: Add any annotation omission errors to the current report

educe.stac.sanity.checks.annotation.search_for_fixme_features(inputs, k)¶: Return a ReportItem for any annotations in the document whose features have a fixme type

educe.stac.sanity.checks.annotation.search_for_missing_rel_feats(inputs, k)¶: Return ReportItems for any relations that are missing expected features

educe.stac.sanity.checks.annotation.search_for_missing_unit_feats(inputs, k)¶: Return ReportItems for any EDUs and CDUs that are missing expected features

educe.stac.sanity.checks.annotation.search_for_unexpected_feats(inputs, k)¶: Return ReportItems for any annotations that are have features we were not expecting them to have

educe.stac.sanity.checks.annotation.unexpected_features(_, anno)¶: Return set of attribute names for any features that we were not expecting to see in the given annotations

educe.stac.sanity.checks.glozz module¶

Sanity checker: low-level Glozz errors

class educe.stac.sanity.checks.glozz.BadIdItem(doc, contexts, anno, expected_id)¶

Bases: educe.stac.sanity.common.ContextItem

An annotation whose identifier does not match its metadata

text()¶

class educe.stac.sanity.checks.glozz.DuplicateItem(doc, contexts, anno, others)¶

Bases: educe.stac.sanity.common.ContextItem

An annotation which shares an id with another

text()¶

class educe.stac.sanity.checks.glozz.IdMismatch(doc, contexts, unit1, unit2)¶

Bases: educe.stac.sanity.common.ContextItem

An annotation which seems to have an equivalent in some twin but with the wrong identifier

annotations()¶

html()¶

exception educe.stac.sanity.checks.glozz.MissingDocumentException(k)¶

Bases: exceptions.Exception

A document we are trying to cross check does not have the expected twin

class educe.stac.sanity.checks.glozz.MissingItem(status, doc1, contexts1, unit, doc2, contexts2, approx)¶

Bases: educe.stac.sanity.report.ReportItem

An annotation which is missing in some document twin (or which looks like it may have been unexpectedly added)

excess_status = 'ADDED'¶

html()¶

missing_status = 'DELETED'¶

status_len = 7¶

text_span()¶: Return the span for the annotation in question

class educe.stac.sanity.checks.glozz.OffByOneItem(doc, contexts, unit)¶

Bases: educe.stac.sanity.common.UnitItem

An annotation whose boundaries might be off by one

html()¶

html_turn_info(parent, turn)¶: Given a turn annotation, append a prettified HTML representation of the turn text (highlighting parts of it, such as the turn number)

class educe.stac.sanity.checks.glozz.OverlapItem(doc, contexts, anno, overlaps)¶

Bases: educe.stac.sanity.common.ContextItem

An annotation whose span overlaps with that of another

annotations()¶

html()¶

educe.stac.sanity.checks.glozz.bad_ids(inputs, k)¶: Return annotations whose identifiers do not match their metadata

educe.stac.sanity.checks.glozz.check_unit_ids(inputs, key1, key2)¶: Return annotations that match in the two documents modulo identifiers. This might arise if somebody creates a duplicate annotation in place and annotates that

educe.stac.sanity.checks.glozz.cross_check_against(inputs, key1, stage='unannotated')¶: Compare annotations with their equivalents on a twin document in the corpus

educe.stac.sanity.checks.glozz.cross_check_units(inputs, key1, key2, status)¶: Return tuples for certain corpus[key1] units not present in corpus[key2]

educe.stac.sanity.checks.glozz.duplicate_annotations(inputs, k)¶: Multiple annotations with the same local_id()

educe.stac.sanity.checks.glozz.filter_matches(unit, other_units)¶: Return any unit-level annotations in other_units that look like they may be the same as the given annotation

educe.stac.sanity.checks.glozz.is_maybe_off_by_one(text, anno)¶: True if an annotation has non-whitespace characters on its immediate left/right

educe.stac.sanity.checks.glozz.overlapping(inputs, k, is_overlap)¶: Return items for annotations that have overlaps

educe.stac.sanity.checks.glozz.overlapping_structs(inputs, k)¶: Return items for structural annotations that have overlaps

educe.stac.sanity.checks.glozz.run(inputs, k)¶: Add any glozz errors to the current report

educe.stac.sanity.checks.glozz.search_glozz_off_by_one(inputs, k)¶: EDUs which have non-whitespace (or boundary) characters either on their right or left

educe.stac.sanity.checks.graph module¶

Sanity checker: fancy graph-based errors

educe.stac.sanity.checks.graph.BACKWARDS_WHITELIST = ['Conditional']¶: relations that are allowed to go backwards

class educe.stac.sanity.checks.graph.CduOverlapItem(doc, contexts, anno, cdus)¶

Bases: educe.stac.sanity.common.ContextItem

EDUs that appear in more than one CDU

annotations()¶

html()¶

educe.stac.sanity.checks.graph.PAIRS_WHITELIST = [('Contrast', 'Comment'), ('Narration', 'Result'), ('Narration', 'Continuation'), ('Parallel', 'Continuation'), ('Parallel', 'Background'), ('Comment', 'Acknowledgement'), ('Parallel', 'Acknowledgement'), ('Question-answer_pair', 'Contrast'), ('Question-answer_pair', 'Parallel')]¶: pairs of relations that are explicitly allowed between the same source/target DUs

educe.stac.sanity.checks.graph.are_single_headed_cdus(inputs, k, gra)¶

Check that each CDU has exactly one head DU.

Parameters:	gra (Graph) – Graph for the discourse structure.
Returns:	report_items – List of report items, one per faulty CDU.
Return type:	list of ReportItem

educe.stac.sanity.checks.graph.dialogue_graphs(k, doc, contexts)¶

Return a dict from dialogue annotations to subgraphs containing at least everything in that dialogue (and perhaps some connected items).

Parameters:	k (FileId) – File identifier doc (TODO) – TODO contexts (dict(Annotation, Context)) – Context for each annotation.
Returns:	graphs – Graph for each dialogue.
Return type:	dict(Dialogue, Graph)

Notes

MM: I could not find any caller for this function in either educe or irit-stac, as of 2017-03-17.

educe.stac.sanity.checks.graph.horrible_context_kludge(graph, simplified_graph, contexts)¶: Given a graph and its copy, and given a context dictionary, return a copy of the context dictionary that corresponds to the simplified graph. Ugh

educe.stac.sanity.checks.graph.is_arrow_inversion(gra, _, rel)¶: Relation in a graph that goes from textual right to left (may not be a problem)

educe.stac.sanity.checks.graph.is_bad_relset(gra, contexts, relset)¶

True if a set of relation instances has more than one member and it is not whitelisted.

Parameters:	gra (Graph) – Graph for the discourse structure. contexts (TODO) – TODO relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns:	res – True if relset contains more than one element and is_whitelisted_relpair returns False.
Return type:	boolean

educe.stac.sanity.checks.graph.is_disconnected(gra, contexts, node)¶

Return True if an EDU is disconnected from a discourse structure.

An EDU is considered disconnected unless:

it has an incoming link or
it has an outgoing Conditional link or
it’s at the beginning of a dialogue

In principle we don’t need to look at EDUs that are disconnected on the outgoing end because (1) it can be legitimate for non-dialogue-ending EDUs to not have outgoing links and (2) such information would be redundant with the incoming anyway.

educe.stac.sanity.checks.graph.is_dupe_rel(gra, _, rel)¶: Relation instance for which there are relation instances between the same source/target DUs (regardless of direction)

educe.stac.sanity.checks.graph.is_non2sided_rel(gra, _, rel)¶

Relation instance which does not have exactly a source and target link in the graph

How this can possibly happen is a mystery

educe.stac.sanity.checks.graph.is_puncture(gra, _, rel)¶: Relation in a graph that traverse a CDU boundary

educe.stac.sanity.checks.graph.is_weird_ack(gra, contexts, rel)¶

Relation in a graph that represent a question answer pair which either does not start with a question, or which ends in a question.

Note the detection process is a lot sloppier when one of the endpoints is a CDU. If all EDUs in the CDU are by the same speaker, we can check as usual; otherwise, all bets are off, so we ignore the relation.

Note: slightly curried to accept contexts as an argument

educe.stac.sanity.checks.graph.is_weird_qap(gra, contexts, rel)¶

Return True if rel is a weird Question-Answer Pair relation.

Parameters:	gra (TODO) – Graph? contexts (TODO) – Surrounding context rel (TODO) – Relation.
Returns:	res – True if rel is a relation that represents a question answer pair which either does not start with a question, or which ends in a question.
Return type:	boolean

educe.stac.sanity.checks.graph.is_whitelisted_relpair(gra, _, relset)¶

True if a pair of instance relations is in PAIRS_WHITELIST.

Parameters:	gra (Graph) – Graph for the discourse structure. contexts (TODO) – TODO relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns:	res – True if relset is a pair of relation instances with the same direction and the corresponding pair of relations is explicitly allowed in the whitelist.
Return type:	boolean

educe.stac.sanity.checks.graph.rel_link_item(doc, contexts, gra, rel)¶: return ReportItem for a graph relation

educe.stac.sanity.checks.graph.rfc_violations(inputs, k, gra)¶: Repackage right frontier contraint violations in a somewhat friendlier way

educe.stac.sanity.checks.graph.run(inputs, k)¶: Add any graph errors to the current report

educe.stac.sanity.checks.graph.search_graph_cdu_overlap(inputs, k, gra)¶: Return a ReportItem for every EDU that appears in more than one CDU

educe.stac.sanity.checks.graph.search_graph_cdus(inputs, k, gra, pred)¶: Return a ReportItem for any CDU in the graph for which the given predicate is True

educe.stac.sanity.checks.graph.search_graph_edus(inputs, k, gra, pred)¶: Return a ReportItem for any EDU within the graph for which some predicate is true

educe.stac.sanity.checks.graph.search_graph_relations(inputs, k, gra, pred)¶: Return a ReportItem for any relation instance within the graph for which some predicate is true

educe.stac.sanity.checks.graph.search_graph_relations_same_dus(inputs, k, gra, pred)¶

Return a list of ReportItem (one per member of the set) for any set of relation instances within the graph for which some predicate is True.

Parameters:	inputs (educe.stac.sanity.main.SanityChecker) – SanityChecker, with attributes corpus and contexts. k (FileId) – Identifier of the desired Glozz document. gra (educe.stac.graph.Graph) – Graph that corresponds to the discourse structure (?). pred (function from (gra, contexts, rel_set) to boolean) – Predicate function.
Returns:	report_items – One ReportItem for each relation instance that belongs to a set of instances, on the same DUs, where pred is True.
Return type:	list of ReportItem

educe.stac.sanity.checks.type_err module¶

STAC sanity-check: type errors

educe.stac.sanity.checks.type_err.has_non_du_member(anno)¶: True if anno is a relation that points to another relation, or if it’s a CDU that has relation members

educe.stac.sanity.checks.type_err.is_non_du(anno)¶: True if the annotation is neither an EDU nor a CDU

educe.stac.sanity.checks.type_err.is_non_preference(anno)¶: True if the annotation is NOT a preference

educe.stac.sanity.checks.type_err.is_non_resource(anno)¶: True if the annotation is NOT a resource

educe.stac.sanity.checks.type_err.run(inputs, k)¶: Add any annotation type errors to the current report

educe.stac.sanity.checks.type_err.search_anaphora(inputs, k, pred)¶: Return a ReportItem for any anaphora annotation in which at least one member (not the annotation itself) is true with the given predicate

educe.stac.sanity.checks.type_err.search_preferences(inputs, k, pred)¶: Return a ReportItem for any Preferences schema which has at least one member for which the predicate is True

educe.stac.sanity.checks.type_err.search_resource_groups(inputs, k, pred)¶: Return a ReportItem for any Several_resources schema which has at least one member for which the predicate is True