educe.learning package¶
Submodules¶
educe.learning.csv module¶
educe.learning.edu_input_format module¶
This module implements a dumper for the EDU input format
See https://github.com/irit-melodi/attelo/blob/master/doc/input.rst
-
educe.learning.edu_input_format.
dump_all
(X_gen, y_gen, f, class_mapping, docs, instance_generator)¶ Dump a whole dataset: features (in svmlight) and EDU pairs.
Parameters: - X_gen (iterable of int arrays) – Feature vectors.
- y_gen (iterable of int) – Ground truth labels.
- f (str) – Output features file path
- class_mapping (dict(str, int)) – Mapping from label to int.
- docs (list of DocumentPlus) – Documents
- instance_generator (function from doc to iterable of pairs) – TODO
-
educe.learning.edu_input_format.
dump_edu_input_file
(docs, f)¶ Dump a dataset in the EDU input format.
Each document must have:
- edus: sequence of edu objects
- grouping: string (some sort of document id)
- edu2sent: int -> int or string or None (edu num to sentence num)
The EDUs must provide:
- identifier(): string
- text(): string
-
educe.learning.edu_input_format.
dump_pairings_file
(epairs, f)¶ Dump the EDU pairings
-
educe.learning.edu_input_format.
labels_comment
(class_mapping)¶ Return a string listing class labels in the format that attelo expects
-
educe.learning.edu_input_format.
load_labels
(f)¶ Read label set into a dictionary mapping labels to indices
educe.learning.keygroup_vectorizer module¶
This module provides ways to transform lists of PairKeys to sparse vectors.
-
class
educe.learning.keygroup_vectorizer.
KeyGroupVectorizer
¶ Bases:
object
Transforms lists of KeyGroups to sparse vectors.
-
vocabulary_
¶ dict(str, int) – Vocabulary mapping.
-
fit_transform
(vectors)¶ Learn the vocabulary dictionary and return instances
-
transform
(vectors)¶ Transform documents to EDU pair feature matrix.
Extract features out of documents using the vocabulary fitted with fit.
-
educe.learning.keys module¶
Feature extraction keys.
A key is basically a feature name, its type, some help text.
We also provide a notion of groups that allow us to organise keys into sections
-
class
educe.learning.keys.
Key
(substance, name, description)¶ Bases:
object
Feature name plus a bit of metadata
-
classmethod
basket
(name, description)¶ A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)
-
classmethod
continuous
(name, description)¶ A key for fields that have range value (eg. numbers)
-
classmethod
discrete
(name, description)¶ A key for fields that have a finite set of possible values
-
substance
= None¶ see Substance
-
classmethod
-
class
educe.learning.keys.
KeyGroup
(description, keys)¶ Bases:
dict
A set of related features.
Note that a KeyGroup can be used as a dictionary, but instead of using Keys as values, you use the key names
-
DEBUG
= True¶
-
NAME_WIDTH
= 35¶
-
one_hot_values_gen
(suffix='')¶ Get a one-hot encoded version of this KeyGroups as a generator
suffix is added to the feature name
-
-
class
educe.learning.keys.
MagicKey
(substance, function)¶ Bases:
educe.learning.keys.Key
Somewhat fancier variant of Key that is built from a function The goal of the magic key is to reduce the amount of boilerplate needed to define keys
-
classmethod
basket_fn
(function)¶ A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)
-
classmethod
continuous_fn
(function)¶ A key for fields that have range value (eg. numbers)
-
classmethod
discrete_fn
(function)¶ A key for fields that have a finite set of possible values
-
classmethod
-
class
educe.learning.keys.
MergedKeyGroup
(description, groups)¶ Bases:
educe.learning.keys.KeyGroup
A key group that is formed by fusing several key groups into one.
Note that for now all the keys in a merged group are lumped into the same object.
The help text tries to preserve the internal breakdown into the subgroups, however. It comes with a “level 1” section header, eg.
======================================================= big block of features =======================================================
-
class
educe.learning.keys.
Substance
¶ Bases:
object
The kind of the variable represented by this key.
- continuous
- discrete
- string (for meta vars; you probably want discrete instead)
If we ever reach a point where we’re happy to switch to Python 3 wholesale, we should subclass Enum
-
BASKET
= 4¶
-
CONTINUOUS
= 1¶
-
DISCRETE
= 2¶
-
STRING
= 3¶
educe.learning.svmlight_format module¶
This module implements a dumper for the svmlight format
See sklearn.datasets.svmlight_format
-
educe.learning.svmlight_format.
dump_svmlight_file
(X_gen, y_gen, f, zero_based=True, comment=None, query_id=None)¶ Dump the dataset in svmlight file format.
educe.learning.util module¶
Common helper functions for feature extraction.
-
educe.learning.util.
space_join
(str1, str2)¶ join two strings with a space
-
educe.learning.util.
tuple_feature
(combine)¶ (a -> a -> b) -> ((current, cache, edu) -> a) -> (current, cache, edu, edu) -> b)
Combine the result of single-edu feature function to make a pair feature
-
educe.learning.util.
underscore
(str1, str2)¶ join two strings with an underscore
educe.learning.vocabulary_format module¶
This module implements a loader and dumper for vocabularies.
-
educe.learning.vocabulary_format.
dump_vocabulary
(vocabulary, f)¶ Dump the vocabulary as a tab-separated file.
-
educe.learning.vocabulary_format.
load_vocabulary
(f)¶ Read vocabulary file into a dictionary of feature name and index