educe package¶
Note: At the time of this writing, this is a slightly idealised representation of the package. See below for notes on where things get a bit messier
The educe library provides utilities for working with annotated discourse corpora. It has a three-layer structure:
- base layer (files, annotations, fusion, graphs)
- tool layer (specific to tools, file formats, etc)
- project layer (specific to particular corpora, currently stac)
Layers¶
Working our way up the tower, the base layer provides four sublayers:
- file management (educe.corpus): basic model for corpus traversal, for selecting slices of the corpus
- annotation: (educe.annotation), representation of annotated texts, adhering closely to whatever annotation tool produced it.
- fusion (in progress): connections between annotations on different layers (eg. on speech acts for text spans, discourse relations), or from different tools (eg. from a POS tagger, a parser, etc)
- graph (educe.graph): high-level/abstract representation of discourse structure, allowing for queries on the structures themselves (eg. give me all pairs for discourse units separated by at most 3 nodes in the graph)
Building on the base layer, we have modules that are specific to a particular set of annotation tools, currently this is only educe.glozz. We aim to add modules sparingly.
Finally, on top of this, we have the project layer (eg. educe.stac) which keeps track of conventions specific to this particular corpus. The hope would be for most of your script writing to deal with this layer directly, eg. for STAC
stac [project layer]
|
+--------+-------------+--------+
| | | |
| v | |
| glozz | | [tool layer]
| | | |
v v v v
corpus -> annotation <- fusion <- graph [base layer]
Support for other projects would consist in adding writing other project layer modules that map down to the tool layer.
Departures from the ideal (2013-05-23)¶
Educe is still its early stages. Some departures you may want to be aware of:
- fusion layer does not really exist yet; educe.annotation currently takes on some of the job (for example, the text_span function makes annotations of different types more or less comparable)
- layer violations: ideally we want lower layers to be abstract from things above them, but you may find eg. glozz-specific assumptions in the base layer, which isn’t great.
- inconsistency in encapsulation: educe.stac doesn’t wrap everything below it (it’s also not clear yet if it should). It currently wraps educe.glozz and educe.corpus (so by rights you shouldn’t really need to import them), but not the graph stuff for example.
Subpackages¶
- educe.external package
- educe.learning package
- educe.pdtb package
- educe.ptb package
- educe.rst_dt package
- Subpackages
- Submodules
- educe.rst_dt.annotation module
- educe.rst_dt.corpus module
- educe.rst_dt.deptree module
- educe.rst_dt.document_plus module
- educe.rst_dt.graph module
- educe.rst_dt.parse module
- educe.rst_dt.ptb module
- educe.rst_dt.rst_wsj_corpus module
- educe.rst_dt.sdrt module
- educe.rst_dt.text module
- educe.stac package
- Subpackages
- Submodules
- educe.stac.annotation module
- educe.stac.context module
- educe.stac.corenlp module
- educe.stac.corpus module
- educe.stac.fake_graph module
- educe.stac.fusion module
- educe.stac.graph module
- educe.stac.postag module
- educe.stac.rfc module
Submodules¶
educe.annotation module¶
Low-level representation of corpus annotations, following somewhat faithfully the Glozz model for annotations.
This is low-level in the sense that we make little attempt to interpret the information stored in these annotations. For example, a relation might claim to link two units of id unit42 and unit43. This being a low-level representation, we simply note the fact. A higher-level representation might attempt to actually make the corresponding units available to you, or perhaps provide some sort of graph representation of them
-
class
educe.annotation.
Annotation
(anno_id, span, atype, features, metadata=None, origin=None)¶ Bases:
educe.annotation.Standoff
Any sort of annotation.
Annotations tend to have: * span: some sort of location (what they are annotating) * type: some key label (we call a type) * features: an attribute to value dictionary
-
identifier
()¶ Global identifier if possible, else local identifier.
String representation of an identifier that should be unique to this corpus at least.
If the unit has an origin (see “FileId”), we use the
- document
- subdocument
- stage
- (but not the annotator!)
- and the id from the XML file
If we don’t have an origin we fall back to just the id provided by the XML file.
See also position as potentially a safer alternative to this (and what we mean by safer)
-
local_id
()¶ Local identifier.
An identifier which is sufficient to pick out this annotation within a single annotation file.
-
-
class
educe.annotation.
Document
(units, relations, schemas, text)¶ Bases:
educe.annotation.Standoff
A single (sub)-document.
This can be seen as collections of unit, relation, and schema annotations
-
annotations
()¶ All annotations associated with this document
-
fleshout
(origin)¶ See set_origin
-
global_id
(local_id)¶ String representation of an identifier that should be unique to this corpus at least.
-
set_origin
(origin)¶ If you have more than one document, it’s a good idea to set its origin to a file ID so that you can more reliably the annotations apart.
-
text
(span=None)¶ Return the text associated with these annotations (or None), optionally limited to a span
-
-
class
educe.annotation.
RelSpan
(t1, t2)¶ Bases:
object
Which two units a relation connects.
-
t1
= None¶ string – id of an annotation
-
t2
= None¶ string – id of an annotation
-
-
class
educe.annotation.
Relation
(rel_id, span, rtype, features, metadata=None)¶ Bases:
educe.annotation.Annotation
An annotation between two annotations.
Relations are directed; see RelSpan for details
Use the source and target field to grab these respective annotations, but note that they are only instantiated after fleshout is called (corpus slurping normally fleshes out documents and thus their relations).
-
fleshout
(objects)¶ Given a dictionary mapping ids to annotation objects, set this relation’s source and target fields.
-
source
= None¶ source annotation; will be defined by fleshout
-
target
= None¶ target annotation; will be defined by fleshout
-
-
class
educe.annotation.
Schema
(rel_id, units, relations, schemas, stype, features, metadata=None)¶ Bases:
educe.annotation.Annotation
An annotation between a set of annotations
Use the members field to grab the annotations themselves. But note that it is only created when fleshout is called.
-
fleshout
(objects)¶ Given a dictionary mapping ids to annotation objects, set this schema’s members field to point to the appropriate objects
-
terminals
()¶ All unit-level annotations contained in this schema or (recursively in schema contained herein)
-
-
class
educe.annotation.
Span
(start, end)¶ Bases:
object
What portion of text an annotation corresponds to. Assumed to be in terms of character offsets
The way we interpret spans in educe amounts to how Python interprets array slice indices.
One way to understand them is to think of offsets as sitting in between individual characters
h o w d y 0 1 2 3 4 5
So (0,5) covers the whole word above, and (1,2) picks out the letter “o”
-
absolute
(other)¶ Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.
-
encloses
(other)¶ Return True if this span includes the argument
Note that x.encloses(x) == True
Corner case: x.encloses(None) == False
See also educe.graph.EnclosureGraph if you might be repeating these checks
-
length
()¶ Return the length of this span
-
merge
(other)¶ Return a span that stretches from the beginning to the end of the two spans. Whereas overlaps can be thought of as returning the intersection of two spans, this can be thought of as returning the union.
-
classmethod
merge_all
(spans)¶ Return a span that stretches from the beginning to the end of all the spans in the list
-
overlaps
(other, inclusive=False)¶ Return the overlapping region if two spans have regions in common, or else None.
Span(5, 10).overlaps(Span(8, 12)) == Span(8, 10) Span(5, 10).overlaps(Span(11, 12)) == None
If inclusive == True, spans with touching edges are considered to overlap
Span(5, 10).overlaps(Span(10, 12)) == None Span(5, 10).overlaps(Span(10, 12), inclusive=True) == Span(10, 10)
-
relative
(other)¶ Assuming this span is relative to some other span, return a suitably shifted “absolute” copy.
-
shift
(offset)¶ Return a copy of this span, shifted to the right (if offset is positive) or left (if negative).
It may be a bit more convenient to use ‘absolute/relative’ if you’re trying to work with spans that are within other spans.
-
-
class
educe.annotation.
Standoff
(origin=None)¶ Bases:
object
A standoff object ultimately points to some piece of text.
The pointing is not necessarily direct though.
-
origin
¶ educe.corpus.FileId, optional – FileId of the document supporting this standoff.
-
encloses
(other)¶ True if this annotation’s span encloses the span of the other.
s1.encloses(s2) is shorthand for s1.text_span().encloses(s2.text_span())
Parameters: other (Standoff) – Other annotation. Returns: res – True if this annotation’s span encloses the span of the other. Return type: boolean
-
overlaps
(other)¶ True if this annotations’s span overlaps with the span of the other.
s1.overlaps(s2) is shorthand for s1.text_span().overlaps(s2.text_span())
Parameters: other (Standoff) – Other annotation. Returns: res – True if this annotation’s span overlaps with the span of the other. Return type: boolean
-
text_span
()¶ Return the span from the earliest terminal annotation contained here to the latest.
Corner case: if this is an empty non-terminal (which would be a very weird thing indeed), return None.
Returns: res – Span from the first character of the earliest terminal annotation contained here, to the last character of the latest terminal annotation ; None if this annotation has no terminal. Return type: Span or None
-
-
class
educe.annotation.
Unit
(unit_id, span, utype, features, metadata=None, origin=None)¶ Bases:
educe.annotation.Annotation
Unit annotation.
An annotation over a span of text.
-
position
()¶ The position is the set of “geographical” information only to identify an item. So instead of relying on some sort of name, we might rely on its text span. We assume that some name-based elements (document name, subdocument name, stage) can double as being positional.
If the unit has an origin (see “FileId”), we use the
- document
- subdocument
- stage
- (but not the annotator!)
- and its text span
position vs identifier
This is a trade-off. On the one hand, you can see the position as being a safer way to identify a unit, because it obviates having to worry about your naming mechanism guaranteeing stability across the board (eg. two annotators stick an annotation in the same place; does it have the same name). On the other hand, it’s a bit harder to uniquely identify objects that may coincidentally fall in the same span. So how much do you trust your IDs?
-
educe.corpus module¶
Corpus management
-
class
educe.corpus.
FileId
(doc, subdoc, stage, annotator)¶ Information needed to uniquely identify an annotation file.
Note that this includes the annotator, so if you want to do comparisons on the “same” file between annotators you’ll want to ignore this field.
Parameters: - doc (string) – document name
- subdoc (string) – subdocument (often None); sometimes you may have a need to divide a document into smaller pieces (for exmaple working with tools that require too much memory to process large documents). The subdocument identifies which piece of the document you are working with. If you don’t have a notion of subdocuments, just use None
- stage (string) – annotation stage; for use if you have distinct files that correspond to different stages of your annotation process (or different processing tools)
- annotator (string) – the annotator (or annotation tool) that generated this annoation file
-
mk_global_id
(local_id)¶ String representation of an identifier that should be unique to this corpus at least.
If the unit has an origin (see “FileId”), we use the
- document
- subdocument
- (but not the stage!)
- (but not the annotator!)
- and the id from the XML file
If we don’t have an origin we fall back to just the id provided by the XML file
See also position as potentially a safer alternative to this (and what we mean by safer)
-
class
educe.corpus.
Reader
(root)¶ Reader provides little more than dictionaries from FileId to data.
Parameters: rootdir (str) – the top directory of the corpus A potentially useful pattern to apply here is to take a slice of these dictionaries for processing. For example, you might not want to read the whole corpus, but only the files which are modified by certain annotators.
reader = Reader(corpus_dir) files = reader.files() subfiles = {k: v in files.items() if k.annotator in ['Bob', 'Alice']} corpus = reader.slurp(subfiles)
Alternatively, having read in the entire corpus, you might be doing processing on various slices of it at a time
corpus = reader.slurp() subcorpus = {k: v in corpus.items() if k.doc == 'pilot14'}
This is an abstract class; you should use the version from a data-set, eg. educe.stac.Reader instead
-
files
(doc_glob=None)¶ Return a dictionary from FileId to (tuples of) filepaths. The tuples correspond to files that are considered to ‘belong’ together; for example, in the case of standoff annotation, both the text file and its annotations
Derived classes
Parameters: doc_glob (str, optional) – Glob expression for names of game folders ; if None, subclasses are expected to use the wildcard ‘*’ that matches all strings.
-
filter
(d, pred)¶ Convenience function equivalent to
{ k:v for k,v in d.items() if pred(k) }
-
slurp
(cfiles=None, doc_glob=None, verbose=False)¶ Read the entire corpus if cfiles is None or else the subset specified by cfiles.
Return a dictionary from FileId to educe.Annotation.Document
Parameters: - cfiles (dict, optional) – Dict of files like what Corpus.files() would return.
- doc_glob (str, optional) – Glob pattern for doc (folder) names ; ignored if cfiles is not None.
- verbose (boolean, defaults to False) – If True, print what we’re reading to stderr.
-
slurp_subcorpus
(cfiles, verbose=False)¶ Derived classes should implement this function
-
educe.glozz module¶
The Glozz file format in educe.annotation form
You’re likely most interested in slurp_corpus and read_annotation_file
-
class
educe.glozz.
GlozzDocument
(hashcode, unit, rels, schemas, text)¶ Bases:
educe.annotation.Document
Representation of a glozz document
-
set_origin
(origin)¶
-
to_xml
(settings=<educe.glozz.GlozzOutputSettings object>)¶
-
-
exception
educe.glozz.
GlozzException
(*args, **kw)¶ Bases:
exceptions.Exception
-
class
educe.glozz.
GlozzOutputSettings
(feature_order, metadata_order)¶ Bases:
object
Non-essential aspects of Glozz XML output, such as the order that feature structures or metadata are written out. Controlling these settings could be useful when you want to automatically modify an existing Glozz document, but produce only minimal textual diffs along the way for revision control, comparability, etc.
-
educe.glozz.
glozz_annotation_to_xml
(self, tag='annotation', settings=<educe.glozz.GlozzOutputSettings object>)¶
-
educe.glozz.
glozz_relation_to_span_xml
(self)¶
-
educe.glozz.
glozz_schema_to_span_xml
(self)¶
-
educe.glozz.
glozz_unit_to_span_xml
(self)¶
-
educe.glozz.
hashcode
(f)¶ Hashcode mechanism as documented in the Glozz manual appendix. Hint, using cStringIO to get the hashcode for a string
-
educe.glozz.
ordered_keys
(preferred, d)¶ Keys from a dictionary starting with ‘preferred’ ones in the order of preference
-
educe.glozz.
read_annotation_file
(anno_filename, text_filename=None)¶ Read a single glozz annotation file and its corresponding text (if any).
-
educe.glozz.
read_node
(node, context=None)¶
-
educe.glozz.
write_annotation_file
(anno_filename, doc, settings=<educe.glozz.GlozzOutputSettings object>)¶ Write a GlozzDocument to XML in the given path
educe.graph module¶
Graph representation of discourse structure. Classes of interest:
- Graph: the core structure, use the Graph.from_doc factory method to build one out of an educe.annotation document.
- DotGraph: visual representation, built from Graph. You probably want a project-specific variant to get more helpful graphs, see eg. educe.stac.Graph.DotGraph
Educe hypergraphs¶
Somewhat tricky hypergraph representation of discourse structure.
- a node for every elementary discourse unit
- a hyperedge for every relation instance [1]
- a hyperedge for every complex discourse unit
- (the tricky bit) for every (hyper)edge e_x in the graph, introduce a “mirror node” n_x for that edge (this node also has e_x as its “mirror edge”)
The tricky bit is a response to two issues that arise: (A) how do we point to a CDU? Our hypergraph formalism and library doesn’t have a notion of pointing to hyperedges (only nodes) and (B) what do we do about misannotations where we have relation instances pointing to relation instances? A is the most important one to address (in principle, we could just treat B as an error and raise an exception), but for now we decide to model both scenarios, and the same “mirror” mechanism above.
The mirrors are a bit problematic because are not part of the formal graph structure (think of them as extra labels). This could lead to some seriously unintuitive consequences when traversing the graph. For example, if you two DUs A and B connected by an Elab instance, and if that instance is itself (bizarrely) connected to some other DU, you might intuitively expect A, B, and C to all form one connected component
A
|
Elab |
o--------->C
| Comment
|
v
B
Alas, this is not so! The reality is a bit messier, with there being no formal relationship between edge and mirror
A
|
Elab | n_ab
| o--------->C
| Comment
|
v
B
The same goes for the connectedness of things pointing to CDUs and with their members. Looking at pictures, you might intuitively think that if a discourse unit (A) were connected to a CDU, it would also be connected to the discourse units within
A
|
Elab |
|
v
+-----+
| B C |
+-----+
The reality is messier for the same reasons above
A
|
Elab | +-----+ e_bc
| | B C |
v +-----+
n_bc
[1] | just a binary hyperedge, ie. like an edge in a regular graph. As these are undirected, we take the convention that the the first link is the tail (from) and the second link is the tail (to). |
Classes¶
-
class
educe.graph.
AttrsMixin
¶ Attributes common to both the hypergraph and directed graph representation of discourse structure
-
annotation
(x)¶ Return the annotation object corresponding to a node or edge
-
edge_attributes_dict
(x)¶
-
edgeform
(x)¶ Return the argument if it is an edge id, or its mirror if it’s an edge id
(This is possible because every edge in the graph has a node that corresponds to it)
-
is_cdu
(x)¶
-
is_edu
(x)¶
-
is_relation
(x)¶
-
mirror
(x)¶ For objects (particularly, relations/CDUs) that have a mirror image, ie. an edge representation if it’s a node or vice-versa, return the identifier for that image
-
node
(x)¶ DEPRECATED (renamed 2013-11-19): use self.nodeform(x) instead
-
node_attributes_dict
(x)¶
-
nodeform
(x)¶ Return the argument if it is a node id, or its mirror if it’s an edge id
(This is possible because every edge in the graph has a node that corresponds to it)
-
type
(x)¶ Return if a node/edge is of type ‘EDU’, ‘rel’, or ‘CDU’
-
-
class
educe.graph.
DotGraph
(anno_graph)¶ Bases:
pydot.Dot
A dot representation of this graph for visualisation. The to_string() method is most likely to be of interest here
This is fairly abstract and unhelpful. You probably want the project-layer extension instead, eg. educe.stac.graph
-
exception
educe.graph.
DuplicateIdException
(duplicate)¶ Bases:
exceptions.Exception
Condition that arises in inconsistent corpora
-
class
educe.graph.
EnclosureDotGraph
(enc_graph)¶ Bases:
pydot.Dot
-
class
educe.graph.
EnclosureGraph
(annotations, key=None)¶ Bases:
pygraph.classes.digraph.digraph
,educe.graph.AttrsMixin
Caching mechanism for span enclosure. Given an iterable of Annotation, return a directed graph where nodes point to the largest nodes they enclose (i.e. not to nodes that are enclosed by intermediary nodes they point to). As a slight twist, we also allow nodes to redundantly point to enclosed nodes of the same typ.
This should give you a multipartite graph with each layer representing a different type of annotation, but no promises! We can’t guarantee that the graph will be nicely layered because the annotations may be buggy (either nodes wrongly typed, or nodes of the same type that wrongly enclose each other), so you should not rely on this property aside from treating it as an optimisation.
Note: there is a corner case for nodes that have the same span. Technically a span encloses itself, so the graph could have a loop. If you supply a sort key that differentiates two nodes, we use it as a tie-breaker (first node encloses second). Otherwise, we simply exclude both links.
NB: nodes are labelled by their annotation id
Initialisation parameters
- annotations - iterable of Annotation
- key - disambiguation key for nodes with same span
- (annotation -> sort key)
-
inside
(annotation)¶ Given an annotation, return all annotations that are directly within it. Results are returned in the order of their local id
-
outside
(annotation)¶ Given an annotation, return all annotations it is directly enclosed in. Results are returned in the order of their local id
-
class
educe.graph.
Graph
¶ Bases:
pygraph.classes.hypergraph.hypergraph
,educe.graph.AttrsMixin
Hypergraph representation of discourse structure. See the section on Educe hypergraphs
You most likely want to use Graph.from_doc instead of instantiating an instance directly
Every node/hyperedge is represented as string unique within the graph. Given one of these identifiers x and a graph g:
- g.type(x) returns one of the strings “EDU”, “CDU”, “rel”
- g.annotation(x) returns an educe.annotation object
- for relations and CDUs, if e_x is the edge representation of the relation/cdu, g.mirror(x) will return its mirror node n_x and vice-versa
TODOS:
- TODO: Currently we use educe.annotation objects to represent the EDUs, CDUs and relations, but this is likely a bit too low-level to be helpful. It may be nice to have higher-level EDU and CDU objects instead
-
cdu_members
(cdu, deep=False)¶ Return the set of EDUs, CDUs, and relations which can be considered as members of this CDU.
This is shallow by default, in that we only return the immediate members of the CDU. If deep==True, also return members of CDUs that are members of (members of ..) this CDU.
-
cdus
()¶ Set of hyperedges representing complex discourse units.
See also cdu_members
-
connected_components
()¶ Return a set of a connected components.
Each connected component set can be passed to self.copy() to be copied as a subgraph.
This builds on python-graph’s version of a function with the same name but also adds awareness of our conventions about there being both a node/edge for relations/CDUs.
-
containing_cdu
(node)¶ Given an EDU (or CDU, or relation instance), return immediate containing CDU (the hyperedge) if there is one or None otherwise. If there is more than one containing CDU, return one of them arbitrarily.
-
containing_cdu_chain
(node)¶ Given an annotation, return a list which represents its containing CDU, the container’s container, and forth. Return the empty list if no CDU contains this one.
-
copy
(nodeset=None)¶ Return a copy of the graph, optionally restricted to a subset of EDUs and CDUs.
Note that if you include a CDU, then anything contained by that CDU will also be included.
You don’t specify (or otherwise have control over) what relations are copied. The graph will include all hyperedges whose links are all (a) members of the subset or (b) (recursively) hyperedges included because of (a) and (b)
Note that any non-EDUs you include in the copy set will be silently ignored.
This is a shallow copy in the sense that the underlying layer of annotations and documents remains the same.
Parameters: nodeset (iterable of strings) – only copy nodes with these names
-
edus
()¶ Set of nodes representing elementary discourse units
-
classmethod
from_doc
(corpus, doc_key, could_include=<function <lambda>>, pred=<function <lambda>>)¶ Return a graph representation of a document
Note: check the project layer for a version of this function which may be more appropriate to your project
Parameters: - corpus (dict from FileId to documents) – educe corpus dictionary
- doc_key (FileId) – key pointing to the document
- could_include (annotation -> boolean) – predicate on unit level annotations that should be included regardless of whether or not we have links to them
- pred (annotation -> boolean) – predicate on annotations providing some requirement they must satisfy in order to be taken into account (you might say that could_include gives; and pred takes away)
-
rel_links
(edge)¶ Given an edge in the graph, return a tuple of its source and target nodes.
If the edge has only a single link, we assume it’s a loop and return the same value for both
-
relations
()¶ Set of relation edges representing the relations in the graph. By convention, the first link is considered the source and the the second is considered the target.
educe.internalutil module¶
Utility functions which are meant to be used by educe but aren’t expected to be too useful outside of it
-
exception
educe.internalutil.
EduceXmlException
(*args, **kw)¶ Bases:
exceptions.Exception
-
educe.internalutil.
indent_xml
(elem, level=0)¶ From <http://effbot.org/zone/element-lib.htm>
WARNING: destructive
-
educe.internalutil.
linebreak_xml
(elem)¶ Insert a break after each element tag
You probably want indent_xml instead
-
educe.internalutil.
on_single_element
(root, default, f, name)¶ Return
- the default if no elements
- f(the node) if one element
- an exception if more than one
-
educe.internalutil.
treenode
(tree)¶ API-change padding for NLTK 2 vs NLTK 3 trees
educe.util module¶
Miscellaneous utility functions
-
educe.util.
FILEID_FIELDS
= ['stage', 'doc', 'subdoc', 'annotator']¶ String representation of fields recognised in an educe.corpus.FileId
-
educe.util.
add_corpus_filters
(parser, fields=None, choice_fields=None)¶ For help with script-building:
Augment an argparser with options to filter a corpus on the various attributes in a ‘educe.corpus.FileId’ (eg, document, annotator).
Parameters: - fields ([String]) – which flag names to include (defaults to FILEID_FIELDS)
- choice_fields (Dict String [String]) – fields which accept a limited range of answers
Meant to be used in conjunction with mk_is_interesting
-
educe.util.
add_subcommand
(subparsers, module)¶ Add a subcommand to an argparser following some conventions:
- the module can have an optional NAME constant (giving the name of the command); otherwise we assume it’s the unqualified module name
- the first line of its docstring is its help text
- subsequent lines (if any) form its epilog
Returns the resulting subparser for the module
-
educe.util.
concat
(items)¶ :: Iterable (Iterable a) -> Iterable a
-
educe.util.
concat_l
(items)¶ :: [[a]] -> [a]
-
educe.util.
fields_without
(unwanted)¶ Fields for add_corpus_filters without the unwanted members
-
educe.util.
mk_is_interesting
(args, preselected=None)¶ Return a function that when given a FileId returns ‘True’ if the FileId would be considered interesting according to the arguments passed in.
Parameters: preselected (Dict String [String]) – fields for which we already know what matches we want Meant to be used in conjunction with add_corpus_filters
-
educe.util.
relative_indices
(group_indices, reverse=False, valna=None)¶ Generate a list of relative indices inside each group. Missing (None) values are handled specifically: each missing value is mapped to valna.
Parameters: - reverse (boolean, optional) – If True, compute indices relative to the end of each group.
- valna (int or None, optional) – Relative index for missing values.