educe.stac.sanity.checks package¶
Submodules¶
educe.stac.sanity.checks.annotation module¶
STAC sanity-check: annotation oversights
-
class
educe.stac.sanity.checks.annotation.
FeatureItem
(doc, contexts, anno, attrs, status='missing')¶ Bases:
educe.stac.sanity.common.ContextItem
Annotations that are missing some feature(s)
-
annotations
()¶
-
html
()¶
-
-
educe.stac.sanity.checks.annotation.
is_blank_edu
(anno)¶ True if the annotation looks like it may be an unannotated EDU
-
educe.stac.sanity.checks.annotation.
is_cross_dialogue
(contexts)¶ The units connected by this relation (or cdu) do not inhabit the same dialogue.
-
educe.stac.sanity.checks.annotation.
is_fixme
(feature_value)¶ True if a feature value has a fixme value
-
educe.stac.sanity.checks.annotation.
is_review_edu
(anno)¶ True if the annotation has a FIXME tagged type
-
educe.stac.sanity.checks.annotation.
missing_features
(doc, anno)¶ Return set of attribute names for any expected features that may be missing for this annotation
-
educe.stac.sanity.checks.annotation.
run
(inputs, k)¶ Add any annotation omission errors to the current report
-
educe.stac.sanity.checks.annotation.
search_for_fixme_features
(inputs, k)¶ Return a ReportItem for any annotations in the document whose features have a fixme type
-
educe.stac.sanity.checks.annotation.
search_for_missing_rel_feats
(inputs, k)¶ Return ReportItems for any relations that are missing expected features
-
educe.stac.sanity.checks.annotation.
search_for_missing_unit_feats
(inputs, k)¶ Return ReportItems for any EDUs and CDUs that are missing expected features
-
educe.stac.sanity.checks.annotation.
search_for_unexpected_feats
(inputs, k)¶ Return ReportItems for any annotations that are have features we were not expecting them to have
-
educe.stac.sanity.checks.annotation.
unexpected_features
(_, anno)¶ Return set of attribute names for any features that we were not expecting to see in the given annotations
educe.stac.sanity.checks.glozz module¶
Sanity checker: low-level Glozz errors
-
class
educe.stac.sanity.checks.glozz.
BadIdItem
(doc, contexts, anno, expected_id)¶ Bases:
educe.stac.sanity.common.ContextItem
An annotation whose identifier does not match its metadata
-
text
()¶
-
-
class
educe.stac.sanity.checks.glozz.
DuplicateItem
(doc, contexts, anno, others)¶ Bases:
educe.stac.sanity.common.ContextItem
An annotation which shares an id with another
-
text
()¶
-
-
class
educe.stac.sanity.checks.glozz.
IdMismatch
(doc, contexts, unit1, unit2)¶ Bases:
educe.stac.sanity.common.ContextItem
An annotation which seems to have an equivalent in some twin but with the wrong identifier
-
annotations
()¶
-
html
()¶
-
-
exception
educe.stac.sanity.checks.glozz.
MissingDocumentException
(k)¶ Bases:
exceptions.Exception
A document we are trying to cross check does not have the expected twin
-
class
educe.stac.sanity.checks.glozz.
MissingItem
(status, doc1, contexts1, unit, doc2, contexts2, approx)¶ Bases:
educe.stac.sanity.report.ReportItem
An annotation which is missing in some document twin (or which looks like it may have been unexpectedly added)
-
excess_status
= 'ADDED'¶
-
html
()¶
-
missing_status
= 'DELETED'¶
-
status_len
= 7¶
-
text_span
()¶ Return the span for the annotation in question
-
-
class
educe.stac.sanity.checks.glozz.
OffByOneItem
(doc, contexts, unit)¶ Bases:
educe.stac.sanity.common.UnitItem
An annotation whose boundaries might be off by one
-
html
()¶
-
html_turn_info
(parent, turn)¶ Given a turn annotation, append a prettified HTML representation of the turn text (highlighting parts of it, such as the turn number)
-
-
class
educe.stac.sanity.checks.glozz.
OverlapItem
(doc, contexts, anno, overlaps)¶ Bases:
educe.stac.sanity.common.ContextItem
An annotation whose span overlaps with that of another
-
annotations
()¶
-
html
()¶
-
-
educe.stac.sanity.checks.glozz.
bad_ids
(inputs, k)¶ Return annotations whose identifiers do not match their metadata
-
educe.stac.sanity.checks.glozz.
check_unit_ids
(inputs, key1, key2)¶ Return annotations that match in the two documents modulo identifiers. This might arise if somebody creates a duplicate annotation in place and annotates that
-
educe.stac.sanity.checks.glozz.
cross_check_against
(inputs, key1, stage='unannotated')¶ Compare annotations with their equivalents on a twin document in the corpus
-
educe.stac.sanity.checks.glozz.
cross_check_units
(inputs, key1, key2, status)¶ Return tuples for certain corpus[key1] units not present in corpus[key2]
-
educe.stac.sanity.checks.glozz.
duplicate_annotations
(inputs, k)¶ Multiple annotations with the same local_id()
-
educe.stac.sanity.checks.glozz.
filter_matches
(unit, other_units)¶ Return any unit-level annotations in other_units that look like they may be the same as the given annotation
-
educe.stac.sanity.checks.glozz.
is_maybe_off_by_one
(text, anno)¶ True if an annotation has non-whitespace characters on its immediate left/right
-
educe.stac.sanity.checks.glozz.
overlapping
(inputs, k, is_overlap)¶ Return items for annotations that have overlaps
-
educe.stac.sanity.checks.glozz.
overlapping_structs
(inputs, k)¶ Return items for structural annotations that have overlaps
-
educe.stac.sanity.checks.glozz.
run
(inputs, k)¶ Add any glozz errors to the current report
-
educe.stac.sanity.checks.glozz.
search_glozz_off_by_one
(inputs, k)¶ EDUs which have non-whitespace (or boundary) characters either on their right or left
educe.stac.sanity.checks.graph module¶
Sanity checker: fancy graph-based errors
-
educe.stac.sanity.checks.graph.
BACKWARDS_WHITELIST
= ['Conditional']¶ relations that are allowed to go backwards
-
class
educe.stac.sanity.checks.graph.
CduOverlapItem
(doc, contexts, anno, cdus)¶ Bases:
educe.stac.sanity.common.ContextItem
EDUs that appear in more than one CDU
-
annotations
()¶
-
html
()¶
-
-
educe.stac.sanity.checks.graph.
PAIRS_WHITELIST
= [('Contrast', 'Comment'), ('Narration', 'Result'), ('Narration', 'Continuation'), ('Parallel', 'Continuation'), ('Parallel', 'Background'), ('Comment', 'Acknowledgement'), ('Parallel', 'Acknowledgement'), ('Question-answer_pair', 'Contrast'), ('Question-answer_pair', 'Parallel')]¶ pairs of relations that are explicitly allowed between the same source/target DUs
-
educe.stac.sanity.checks.graph.
are_single_headed_cdus
(inputs, k, gra)¶ Check that each CDU has exactly one head DU.
Parameters: gra (Graph) – Graph for the discourse structure. Returns: report_items – List of report items, one per faulty CDU. Return type: list of ReportItem
-
educe.stac.sanity.checks.graph.
dialogue_graphs
(k, doc, contexts)¶ Return a dict from dialogue annotations to subgraphs containing at least everything in that dialogue (and perhaps some connected items).
Parameters: - k (FileId) – File identifier
- doc (TODO) – TODO
- contexts (dict(Annotation, Context)) – Context for each annotation.
Returns: graphs – Graph for each dialogue.
Return type: Notes
MM: I could not find any caller for this function in either educe or irit-stac, as of 2017-03-17.
-
educe.stac.sanity.checks.graph.
horrible_context_kludge
(graph, simplified_graph, contexts)¶ Given a graph and its copy, and given a context dictionary, return a copy of the context dictionary that corresponds to the simplified graph. Ugh
-
educe.stac.sanity.checks.graph.
is_arrow_inversion
(gra, _, rel)¶ Relation in a graph that goes from textual right to left (may not be a problem)
-
educe.stac.sanity.checks.graph.
is_bad_relset
(gra, contexts, relset)¶ True if a set of relation instances has more than one member and it is not whitelisted.
Parameters: - gra (Graph) – Graph for the discourse structure.
- contexts (TODO) – TODO
- relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns: res – True if relset contains more than one element and is_whitelisted_relpair returns False.
Return type: boolean
-
educe.stac.sanity.checks.graph.
is_disconnected
(gra, contexts, node)¶ Return True if an EDU is disconnected from a discourse structure.
An EDU is considered disconnected unless:
- it has an incoming link or
- it has an outgoing Conditional link or
- it’s at the beginning of a dialogue
In principle we don’t need to look at EDUs that are disconnected on the outgoing end because (1) it can be legitimate for non-dialogue-ending EDUs to not have outgoing links and (2) such information would be redundant with the incoming anyway.
-
educe.stac.sanity.checks.graph.
is_dupe_rel
(gra, _, rel)¶ Relation instance for which there are relation instances between the same source/target DUs (regardless of direction)
-
educe.stac.sanity.checks.graph.
is_non2sided_rel
(gra, _, rel)¶ Relation instance which does not have exactly a source and target link in the graph
How this can possibly happen is a mystery
-
educe.stac.sanity.checks.graph.
is_puncture
(gra, _, rel)¶ Relation in a graph that traverse a CDU boundary
-
educe.stac.sanity.checks.graph.
is_weird_ack
(gra, contexts, rel)¶ Relation in a graph that represent a question answer pair which either does not start with a question, or which ends in a question.
Note the detection process is a lot sloppier when one of the endpoints is a CDU. If all EDUs in the CDU are by the same speaker, we can check as usual; otherwise, all bets are off, so we ignore the relation.
Note: slightly curried to accept contexts as an argument
-
educe.stac.sanity.checks.graph.
is_weird_qap
(gra, contexts, rel)¶ Return True if rel is a weird Question-Answer Pair relation.
Parameters: - gra (TODO) – Graph?
- contexts (TODO) – Surrounding context
- rel (TODO) – Relation.
Returns: res – True if rel is a relation that represents a question answer pair which either does not start with a question, or which ends in a question.
Return type: boolean
-
educe.stac.sanity.checks.graph.
is_whitelisted_relpair
(gra, _, relset)¶ True if a pair of instance relations is in PAIRS_WHITELIST.
Parameters: - gra (Graph) – Graph for the discourse structure.
- contexts (TODO) – TODO
- relset (set of relation instances) – Set of relation instances on the same DUs ; each instance is a pair (udir, rel), where: udir is one of {‘src_tgt’, ‘tgt_src’} and rel is the identifier of a relation.
Returns: res – True if relset is a pair of relation instances with the same direction and the corresponding pair of relations is explicitly allowed in the whitelist.
Return type: boolean
-
educe.stac.sanity.checks.graph.
rel_link_item
(doc, contexts, gra, rel)¶ return ReportItem for a graph relation
-
educe.stac.sanity.checks.graph.
rfc_violations
(inputs, k, gra)¶ Repackage right frontier contraint violations in a somewhat friendlier way
-
educe.stac.sanity.checks.graph.
run
(inputs, k)¶ Add any graph errors to the current report
-
educe.stac.sanity.checks.graph.
search_graph_cdu_overlap
(inputs, k, gra)¶ Return a ReportItem for every EDU that appears in more than one CDU
-
educe.stac.sanity.checks.graph.
search_graph_cdus
(inputs, k, gra, pred)¶ Return a ReportItem for any CDU in the graph for which the given predicate is True
-
educe.stac.sanity.checks.graph.
search_graph_edus
(inputs, k, gra, pred)¶ Return a ReportItem for any EDU within the graph for which some predicate is true
-
educe.stac.sanity.checks.graph.
search_graph_relations
(inputs, k, gra, pred)¶ Return a ReportItem for any relation instance within the graph for which some predicate is true
-
educe.stac.sanity.checks.graph.
search_graph_relations_same_dus
(inputs, k, gra, pred)¶ Return a list of ReportItem (one per member of the set) for any set of relation instances within the graph for which some predicate is True.
Parameters: - inputs (educe.stac.sanity.main.SanityChecker) – SanityChecker, with attributes corpus and contexts.
- k (FileId) – Identifier of the desired Glozz document.
- gra (educe.stac.graph.Graph) – Graph that corresponds to the discourse structure (?).
- pred (function from (gra, contexts, rel_set) to boolean) – Predicate function.
Returns: report_items – One ReportItem for each relation instance that belongs to a set of instances, on the same DUs, where pred is True.
Return type: list of ReportItem
educe.stac.sanity.checks.type_err module¶
STAC sanity-check: type errors
-
educe.stac.sanity.checks.type_err.
has_non_du_member
(anno)¶ True if anno is a relation that points to another relation, or if it’s a CDU that has relation members
-
educe.stac.sanity.checks.type_err.
is_non_du
(anno)¶ True if the annotation is neither an EDU nor a CDU
-
educe.stac.sanity.checks.type_err.
is_non_preference
(anno)¶ True if the annotation is NOT a preference
-
educe.stac.sanity.checks.type_err.
is_non_resource
(anno)¶ True if the annotation is NOT a resource
-
educe.stac.sanity.checks.type_err.
run
(inputs, k)¶ Add any annotation type errors to the current report
-
educe.stac.sanity.checks.type_err.
search_anaphora
(inputs, k, pred)¶ Return a ReportItem for any anaphora annotation in which at least one member (not the annotation itself) is true with the given predicate
-
educe.stac.sanity.checks.type_err.
search_preferences
(inputs, k, pred)¶ Return a ReportItem for any Preferences schema which has at least one member for which the predicate is True
-
educe.stac.sanity.checks.type_err.
search_resource_groups
(inputs, k, pred)¶ Return a ReportItem for any Several_resources schema which has at least one member for which the predicate is True