educe.rst_dt.learning package

Submodules

educe.rst_dt.learning.args module

Command line options for learning commands

class educe.rst_dt.learning.args.FeatureSetAction(option_strings, dest, nargs=None, **kwargs)

Bases: argparse.Action

Select the desired feature set

educe.rst_dt.learning.args.add_usual_input_args(parser)

Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require slightly different output arguments, in which case, just don’t call this function.

educe.rst_dt.learning.base module

Basics for feature extraction

class educe.rst_dt.learning.base.DocumentPlusPreprocessor(token_filter=None, word2clust=None)

Bases: object

Preprocessor for feature extraction on a DocumentPlus

This pre-processor currently does not explicitly impute missing values, but it probably should eventually. As the ultimate output is features in a sparse format, the current strategy amounts to imputing missing values as 0, which is most certainly not optimal.

preprocess(doc, strict=False)

Preprocess a document and output basic features for each EDU.

Parameters:doc (DocumentPlus) – Document to be processed.
Returns:
  • edu_infos (list of dict of features) – List of basic features for each EDU ; each feature is a couple (basic_feat_name, basic_feat_val).
  • para_infos (list of dict of features) – List of basic features for each paragraph ; each feature is a couple (basic_feat_name, basic_feat_val).
exception educe.rst_dt.learning.base.FeatureExtractionException(msg)

Bases: exceptions.Exception

Exceptions related to RST trees not looking like we would expect them to

educe.rst_dt.learning.base.edu_feature(wrapped)

Lift a function from edu -> feature to single_function_input -> feature

educe.rst_dt.learning.base.edu_pair_feature(wrapped)

Lifts a function from (edu, edu) -> f to pair_function_input -> f

educe.rst_dt.learning.base.lowest_common_parent(treepositions)

Find tree position of the lowest common parent of a list of nodes.

Parameters:treepositions (list of tree positions) – see nltk.tree.Tree.treepositions()
Returns:tpos_parent – Tree position of the lowest common parent to all the given tree positions.
Return type:tree position
educe.rst_dt.learning.base.on_first_bigram(wrapped)

Lift a function from a -> string to [a] -> string the function will be applied to the up to first two elements of the list and the result concatenated. It returns None if the list is empty

educe.rst_dt.learning.base.on_first_unigram(wrapped)

Lift a function from a -> b to [a] -> b taking the first item or returning None if empty list

educe.rst_dt.learning.base.on_last_bigram(wrapped)

Lift a function from a -> string to [a] -> string the function will be applied to the up to the two elements of the list and the result concatenated. It returns None if the list is empty

educe.rst_dt.learning.base.on_last_unigram(wrapped)

Lift a function from a -> b to [a] -> b taking the last item or returning None if empty list

educe.rst_dt.learning.doc_vectorizer module

This submodule implements document vectorizers

class educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer(instance_generator, feature_set, lecsie_data_dir=None, max_df=1.0, min_df=1, max_features=None, vocabulary=None, separator='=', split_feat_space=None)

Bases: object

Fancy vectorizer for the RST-DT treebank.

See sklearn.feature_extraction.text.CountVectorizer for reference.

build_analyzer()

Return a callable that extracts feature vectors from a doc

decode(doc)

Decode the input into a DocumentPlus.

Currently a no-op except for type checking.

Parameters:doc (educe.rst_dt.document_plus.DocumentPlus) – Rich representation of the document.
Returns:doc – Rich representation of the document.
Return type:educe.rst_dt.document_plus.DocumentPlus
fit(raw_documents, y=None)

Learn a vocabulary dictionary of all features from the documents

fit_transform(raw_documents, y=None)

Learn the vocabulary dictionary and generate a feature matrix per document.

transform(raw_documents)

Transform documents to a feature matrix.

Generate a feature matrix, one row per instance.

Parameters:raw_documents (TODO) – TODO
Yields:row ((row, (tgt, src))) – Feature vector for the next instance.
class educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor(instance_generator, ordered_pairs=True, unknown_label='__UNK__', labelset=None)

Bases: object

Label extractor for the RST-DT treebank.

fixed_labelset_

boolean – True if the labelset has been fixed, i.e. self has been fit.

labelset_

dict – A mapping of labels to indices.

build_analyzer()

Return a callable that extracts feature vectors from a doc

decode(doc)

Currently a no-op if doc is a DocumentPlus.

Raises an exception otherwise. Was: Decode the input into a DocumentPlus.

Parameters:doc (DocumentPlus) – Rich representation of the document.
Returns:doc – Rich representation of doc.
Return type:DocumentPlus
fit(raw_documents)

Learn a labelset from the documents

fit_transform(raw_documents)

Learn the label encoder and return a vector of labels

There is one label per instance extracted from raw_documents.

transform(raw_documents)

Transform documents to a label vector

educe.rst_dt.learning.doc_vectorizer.re_emit(feats, suff)

Re-emit feats with suff appended to each feature name

educe.rst_dt.learning.features module

Feature extraction library functions for RST_DT corpus

educe.rst_dt.learning.features.build_doc_preprocessor()

Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features.build_edu_feature_extractor()

Build the feature extractor for single EDUs

educe.rst_dt.learning.features.build_pair_feature_extractor()

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features.combine_features(feats_g, feats_d, feats_gd)

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

cf – combined features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features.extract_pair_gap(doc, edu_info1, edu_info2)

Document tuple features

educe.rst_dt.learning.features.extract_pair_pos_tags(doc, edu_info1, edu_info2)

POS tag features on EDU pairs

educe.rst_dt.learning.features.extract_pair_raw_word(doc, edu_info1, edu_info2)

raw word features on EDU pairs

educe.rst_dt.learning.features.extract_single_ptb_token_pos(doc, edu_info, para_info)

POS features on PTB tokens for the EDU

educe.rst_dt.learning.features.extract_single_ptb_token_word(doc, edu_info, para_info)

word features on PTB tokens for the EDU

educe.rst_dt.learning.features.extract_single_raw_word(doc, edu_info, para_info)

raw word features for the EDU

educe.rst_dt.learning.features.product_features(feats_g, feats_d, feats_gd)

Generate features by taking the product of features.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

pf – product features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev module

Experimental features.

class educe.rst_dt.learning.features_dev.LecsieFeats(lecsie_data_dir)

Bases: object

Extract Lecsie features from each pair of EDUs

fit(edu_pairs, y=None)

Fit the feature extractor.

Currently a no-op.

Parameters:
  • edu_pairs (TODO) – TODO
  • y (TODO, optional) – TODO
Returns:

self – TODO

Return type:

TODO

transform(edu_pairs)

Extract lecsie features for pairs of EDUs.

This is a generator.

Parameters:edu_pairs (TODO) – TODO
Returns:res – TODO
Return type:TODO
educe.rst_dt.learning.features_dev.build_doc_preprocessor()

Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features_dev.build_edu_feature_extractor()

Build the feature extractor for single EDUs

educe.rst_dt.learning.features_dev.build_pair_feature_extractor(lecsie_data_dir=None)

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features_dev.combine_features(feats_g, feats_d, feats_gd)

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

cf – combined features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev.extract_pair_doc(doc, edu_info1, edu_info2, edu_info_bwn)

Document-level tuple features

educe.rst_dt.learning.features_dev.extract_pair_para(doc, edu_info1, edu_info2, edu_info_bwn)

Paragraph tuple features

educe.rst_dt.learning.features_dev.extract_pair_sent(doc, edu_info1, edu_info2, edu_info_bwn)

Sentence tuple features

educe.rst_dt.learning.features_dev.extract_pair_syntax(doc, edu_info1, edu_info2, edu_info_bwn)

syntactic features for the pair of EDUs

educe.rst_dt.learning.features_dev.extract_single_brown(doc, edu_info, para_info)

Brown cluster features for the EDU

educe.rst_dt.learning.features_dev.extract_single_length(doc, edu_info, para_info)

Sentence features for the EDU

educe.rst_dt.learning.features_dev.extract_single_para(doc, edu_info, para_info)

paragraph features for the EDU

educe.rst_dt.learning.features_dev.extract_single_pdtb_markers(doc, edu_info, para_info)

Features on the presence of PDTB discourse markers in the EDU

educe.rst_dt.learning.features_dev.extract_single_pos(doc, edu_info, para_info)

POS features for the EDU

educe.rst_dt.learning.features_dev.extract_single_sentence(doc, edu_info, para_info)

Sentence features for the EDU

educe.rst_dt.learning.features_dev.extract_single_syntax(doc, edu_info, para_info)

syntactic features for the EDU

educe.rst_dt.learning.features_dev.extract_single_typo(doc, edu_info, para_info)

typographical features for the EDU

educe.rst_dt.learning.features_dev.extract_single_word(doc, edu_info, para_info)

word features for the EDU

educe.rst_dt.learning.features_dev.is_title_cased(tok_seq)

True if a sequence of tokens is title-cased

educe.rst_dt.learning.features_dev.is_upper_entire(tok_seq)

True if a sequence is fully upper-cased

educe.rst_dt.learning.features_dev.is_upper_init(tok_seq)

True if a sequence starts with two upper-cased tokens

educe.rst_dt.learning.features_dev.product_features(feats_g, feats_d, feats_gd)

Generate features by taking the product of features.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

pf – product features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev.split_feature_space(feats_g, feats_d, feats_gd, keep_original=False, split_criterion='dir')

Split feature space on a criterion.

Current supported criteria are: * ‘dir’: directionality of attachment, * ‘sent’: intra/inter-sentential, * ‘dir_sent’: directionality + intra/inter-sentential.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
  • keep_original (boolean, default=False) – whether to keep or replace the original features with the derived split features
  • split_criterion (string) – feature(s) on which to split the feature space, options are ‘dir’ for directionality of attachment, ‘sent’ for intra/inter sentential, ‘dir_sent’ for their conjunction
Returns:

feats_g, feats_d, feats_gd – dicts of features with their copies

Return type:

(dict(feat_name, feat_val))

Notes

This function should probably be generalized and moved to a more relevant place.

educe.rst_dt.learning.features_dev.token_filter_li2014(token)

Token filter defined in Li et al.’s parser.

This filter only applies to tagged tokens.

educe.rst_dt.learning.features_li2014 module

Partial re-implementation of the feature extraction procedure used in [li2014text] for discourse dependency parsing on the RST-DT corpus.

[li2014text]Li, S., Wang, L., Cao, Z., & Li, W. (2014).

Text-level discourse dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 25-35). http://www.aclweb.org/anthology/P/P14/P14-1003.pdf

educe.rst_dt.learning.features_li2014.build_doc_preprocessor()

Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features_li2014.build_edu_feature_extractor()

Build the feature extractor for single EDUs

educe.rst_dt.learning.features_li2014.build_pair_feature_extractor()

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features_li2014.combine_features(feats_g, feats_d, feats_gd)

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

cf – combined features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features_li2014.extract_pair_length(doc, edu_info1, edu_info2)

Sentence tuple features

educe.rst_dt.learning.features_li2014.extract_pair_para(doc, edu_info1, edu_info2)

Paragraph tuple features

educe.rst_dt.learning.features_li2014.extract_pair_pos(doc, edu_info1, edu_info2)

POS tuple features

educe.rst_dt.learning.features_li2014.extract_pair_sent(doc, edu_info1, edu_info2)

Sentence tuple features

educe.rst_dt.learning.features_li2014.extract_pair_word(doc, edu_info1, edu_info2)

word tuple features

educe.rst_dt.learning.features_li2014.extract_single_length(doc, edu_info, para_info)

Sentence features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_para(doc, edu_info, para_info)

paragraph features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_pos(doc, edu_info, para_info)

POS features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_sentence(doc, edu_info, para_info)

Sentence features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_syntax(doc, edu_info, para_info)

syntactic features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_word(doc, edu_info, para_info)

word features for the EDU

educe.rst_dt.learning.features_li2014.get_syntactic_labels(doc, edu_info)

Syntactic labels for this EDU

educe.rst_dt.learning.features_li2014.product_features(feats_g, feats_d, feats_gd)

Generate features by taking the product of features.

Parameters:
  • feats_g (dict(feat_name, feat_val)) – features of the gov EDU
  • feats_d (dict(feat_name, feat_val)) – features of the dep EDU
  • feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:

pf – product features

Return type:

dict(feat_name, feat_val)

educe.rst_dt.learning.features_li2014.token_filter_li2014(token)

Token filter defined in Li et al.’s parser.

This filter only applies to tagged tokens.