educe.learning package

Submodules

educe.learning.csv module

educe.learning.edu_input_format module

This module implements a dumper for the EDU input format

See https://github.com/irit-melodi/attelo/blob/master/doc/input.rst

educe.learning.edu_input_format.dump_all(X_gen, y_gen, f, class_mapping, docs, instance_generator)

Dump a whole dataset: features (in svmlight) and EDU pairs.

Parameters:
  • X_gen (iterable of int arrays) – Feature vectors.
  • y_gen (iterable of int) – Ground truth labels.
  • f (str) – Output features file path
  • class_mapping (dict(str, int)) – Mapping from label to int.
  • docs (list of DocumentPlus) – Documents
  • instance_generator (function from doc to iterable of pairs) – TODO
educe.learning.edu_input_format.dump_edu_input_file(docs, f)

Dump a dataset in the EDU input format.

Each document must have:

  • edus: sequence of edu objects
  • grouping: string (some sort of document id)
  • edu2sent: int -> int or string or None (edu num to sentence num)

The EDUs must provide:

  • identifier(): string
  • text(): string
educe.learning.edu_input_format.dump_pairings_file(epairs, f)

Dump the EDU pairings

educe.learning.edu_input_format.labels_comment(class_mapping)

Return a string listing class labels in the format that attelo expects

educe.learning.edu_input_format.load_labels(f)

Read label set into a dictionary mapping labels to indices

educe.learning.keygroup_vectorizer module

This module provides ways to transform lists of PairKeys to sparse vectors.

class educe.learning.keygroup_vectorizer.KeyGroupVectorizer

Bases: object

Transforms lists of KeyGroups to sparse vectors.

vocabulary_

dict(str, int) – Vocabulary mapping.

fit_transform(vectors)

Learn the vocabulary dictionary and return instances

transform(vectors)

Transform documents to EDU pair feature matrix.

Extract features out of documents using the vocabulary fitted with fit.

educe.learning.keys module

Feature extraction keys.

A key is basically a feature name, its type, some help text.

We also provide a notion of groups that allow us to organise keys into sections

class educe.learning.keys.Key(substance, name, description)

Bases: object

Feature name plus a bit of metadata

classmethod basket(name, description)

A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)

classmethod continuous(name, description)

A key for fields that have range value (eg. numbers)

classmethod discrete(name, description)

A key for fields that have a finite set of possible values

substance = None

see Substance

class educe.learning.keys.KeyGroup(description, keys)

Bases: dict

A set of related features.

Note that a KeyGroup can be used as a dictionary, but instead of using Keys as values, you use the key names

DEBUG = True
NAME_WIDTH = 35
one_hot_values_gen(suffix='')

Get a one-hot encoded version of this KeyGroups as a generator

suffix is added to the feature name

class educe.learning.keys.MagicKey(substance, function)

Bases: educe.learning.keys.Key

Somewhat fancier variant of Key that is built from a function The goal of the magic key is to reduce the amount of boilerplate needed to define keys

classmethod basket_fn(function)

A key for fields that represent a multiset of possible values. Baskets should be dictionaries from string to int (collections.Counter would be a good bet for collecting these)

classmethod continuous_fn(function)

A key for fields that have range value (eg. numbers)

classmethod discrete_fn(function)

A key for fields that have a finite set of possible values

class educe.learning.keys.MergedKeyGroup(description, groups)

Bases: educe.learning.keys.KeyGroup

A key group that is formed by fusing several key groups into one.

Note that for now all the keys in a merged group are lumped into the same object.

The help text tries to preserve the internal breakdown into the subgroups, however. It comes with a “level 1” section header, eg.

=======================================================
big block of features
=======================================================
class educe.learning.keys.Substance

Bases: object

The kind of the variable represented by this key.

  • continuous
  • discrete
  • string (for meta vars; you probably want discrete instead)

If we ever reach a point where we’re happy to switch to Python 3 wholesale, we should subclass Enum

BASKET = 4
CONTINUOUS = 1
DISCRETE = 2
STRING = 3

educe.learning.svmlight_format module

This module implements a dumper for the svmlight format

See sklearn.datasets.svmlight_format

educe.learning.svmlight_format.dump_svmlight_file(X_gen, y_gen, f, zero_based=True, comment=None, query_id=None)

Dump the dataset in svmlight file format.

educe.learning.util module

Common helper functions for feature extraction.

educe.learning.util.space_join(str1, str2)

join two strings with a space

educe.learning.util.tuple_feature(combine)
(a -> a -> b) ->
((current, cache, edu) -> a) ->
(current, cache, edu, edu) -> b)

Combine the result of single-edu feature function to make a pair feature

educe.learning.util.underscore(str1, str2)

join two strings with an underscore

educe.learning.vocabulary_format module

This module implements a loader and dumper for vocabularies.

educe.learning.vocabulary_format.dump_vocabulary(vocabulary, f)

Dump the vocabulary as a tab-separated file.

educe.learning.vocabulary_format.load_vocabulary(f)

Read vocabulary file into a dictionary of feature name and index