educe.rst_dt.learning package¶

Submodules¶

educe.rst_dt.learning.args module¶

Command line options for learning commands

class educe.rst_dt.learning.args.FeatureSetAction(option_strings, dest, nargs=None, **kwargs)¶

Bases: argparse.Action

Select the desired feature set

educe.rst_dt.learning.args.add_usual_input_args(parser)¶: Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require slightly different output arguments, in which case, just don’t call this function.

educe.rst_dt.learning.base module¶

Basics for feature extraction

class educe.rst_dt.learning.base.DocumentPlusPreprocessor(token_filter=None, word2clust=None)¶

Bases: object

Preprocessor for feature extraction on a DocumentPlus

This pre-processor currently does not explicitly impute missing values, but it probably should eventually. As the ultimate output is features in a sparse format, the current strategy amounts to imputing missing values as 0, which is most certainly not optimal.

preprocess(doc, strict=False)¶

Preprocess a document and output basic features for each EDU.

Parameters:	doc (DocumentPlus) – Document to be processed.
Returns:	edu_infos (list of dict of features) – List of basic features for each EDU ; each feature is a couple (basic_feat_name, basic_feat_val). para_infos (list of dict of features) – List of basic features for each paragraph ; each feature is a couple (basic_feat_name, basic_feat_val).

exception educe.rst_dt.learning.base.FeatureExtractionException(msg)¶

Bases: exceptions.Exception

Exceptions related to RST trees not looking like we would expect them to

educe.rst_dt.learning.base.edu_feature(wrapped)¶: Lift a function from edu -> feature to single_function_input -> feature

educe.rst_dt.learning.base.edu_pair_feature(wrapped)¶: Lifts a function from (edu, edu) -> f to pair_function_input -> f

educe.rst_dt.learning.base.lowest_common_parent(treepositions)¶

Find tree position of the lowest common parent of a list of nodes.

Parameters:	treepositions (`list` of tree positions) – see nltk.tree.Tree.treepositions()
Returns:	tpos_parent – Tree position of the lowest common parent to all the given tree positions.
Return type:	tree position

educe.rst_dt.learning.base.on_first_bigram(wrapped)¶: Lift a function from a -> string to [a] -> string the function will be applied to the up to first two elements of the list and the result concatenated. It returns None if the list is empty

educe.rst_dt.learning.base.on_first_unigram(wrapped)¶: Lift a function from a -> b to [a] -> b taking the first item or returning None if empty list

educe.rst_dt.learning.base.on_last_bigram(wrapped)¶: Lift a function from a -> string to [a] -> string the function will be applied to the up to the two elements of the list and the result concatenated. It returns None if the list is empty

educe.rst_dt.learning.base.on_last_unigram(wrapped)¶: Lift a function from a -> b to [a] -> b taking the last item or returning None if empty list

educe.rst_dt.learning.doc_vectorizer module¶

This submodule implements document vectorizers

class educe.rst_dt.learning.doc_vectorizer.DocumentCountVectorizer(instance_generator, feature_set, lecsie_data_dir=None, max_df=1.0, min_df=1, max_features=None, vocabulary=None, separator='=', split_feat_space=None)¶

Bases: object

Fancy vectorizer for the RST-DT treebank.

See sklearn.feature_extraction.text.CountVectorizer for reference.

build_analyzer()¶: Return a callable that extracts feature vectors from a doc

decode(doc)¶

Decode the input into a DocumentPlus.

Currently a no-op except for type checking.

Parameters:	doc (educe.rst_dt.document_plus.DocumentPlus) – Rich representation of the document.
Returns:	doc – Rich representation of the document.
Return type:	educe.rst_dt.document_plus.DocumentPlus

fit(raw_documents, y=None)¶: Learn a vocabulary dictionary of all features from the documents

fit_transform(raw_documents, y=None)¶: Learn the vocabulary dictionary and generate a feature matrix per document.

transform(raw_documents)¶

Transform documents to a feature matrix.

Generate a feature matrix, one row per instance.

Parameters:	raw_documents (TODO) – TODO
Yields:	row ((row, (tgt, src))) – Feature vector for the next instance.

class educe.rst_dt.learning.doc_vectorizer.DocumentLabelExtractor(instance_generator, ordered_pairs=True, unknown_label='__UNK__', labelset=None)¶

Bases: object

Label extractor for the RST-DT treebank.

fixed_labelset_¶: boolean – True if the labelset has been fixed, i.e. self has been fit.

labelset_¶: dict – A mapping of labels to indices.

build_analyzer()¶: Return a callable that extracts feature vectors from a doc

decode(doc)¶

Currently a no-op if doc is a DocumentPlus.

Raises an exception otherwise. Was: Decode the input into a DocumentPlus.

Parameters:	doc (DocumentPlus) – Rich representation of the document.
Returns:	doc – Rich representation of doc.
Return type:	DocumentPlus

fit(raw_documents)¶: Learn a labelset from the documents

fit_transform(raw_documents)¶

Learn the label encoder and return a vector of labels

There is one label per instance extracted from raw_documents.

transform(raw_documents)¶: Transform documents to a label vector

educe.rst_dt.learning.doc_vectorizer.re_emit(feats, suff)¶: Re-emit feats with suff appended to each feature name

educe.rst_dt.learning.features module¶

Feature extraction library functions for RST_DT corpus

educe.rst_dt.learning.features.build_doc_preprocessor()¶: Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features.build_edu_feature_extractor()¶: Build the feature extractor for single EDUs

educe.rst_dt.learning.features.build_pair_feature_extractor()¶

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features.combine_features(feats_g, feats_d, feats_gd)¶

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	cf – combined features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features.extract_pair_gap(doc, edu_info1, edu_info2)¶: Document tuple features

educe.rst_dt.learning.features.extract_pair_pos_tags(doc, edu_info1, edu_info2)¶: POS tag features on EDU pairs

educe.rst_dt.learning.features.extract_pair_raw_word(doc, edu_info1, edu_info2)¶: raw word features on EDU pairs

educe.rst_dt.learning.features.extract_single_ptb_token_pos(doc, edu_info, para_info)¶: POS features on PTB tokens for the EDU

educe.rst_dt.learning.features.extract_single_ptb_token_word(doc, edu_info, para_info)¶: word features on PTB tokens for the EDU

educe.rst_dt.learning.features.extract_single_raw_word(doc, edu_info, para_info)¶: raw word features for the EDU

educe.rst_dt.learning.features.product_features(feats_g, feats_d, feats_gd)¶

Generate features by taking the product of features.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	pf – product features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev module¶

Experimental features.

class educe.rst_dt.learning.features_dev.LecsieFeats(lecsie_data_dir)¶

Bases: object

Extract Lecsie features from each pair of EDUs

fit(edu_pairs, y=None)¶

Fit the feature extractor.

Currently a no-op.

Parameters:	edu_pairs (TODO) – TODO y (TODO, optional) – TODO
Returns:	self – TODO
Return type:	TODO

transform(edu_pairs)¶

Extract lecsie features for pairs of EDUs.

This is a generator.

Parameters:	edu_pairs (TODO) – TODO
Returns:	res – TODO
Return type:	TODO

educe.rst_dt.learning.features_dev.build_doc_preprocessor()¶: Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features_dev.build_edu_feature_extractor()¶: Build the feature extractor for single EDUs

educe.rst_dt.learning.features_dev.build_pair_feature_extractor(lecsie_data_dir=None)¶

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features_dev.combine_features(feats_g, feats_d, feats_gd)¶

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	cf – combined features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev.extract_pair_doc(doc, edu_info1, edu_info2, edu_info_bwn)¶: Document-level tuple features

educe.rst_dt.learning.features_dev.extract_pair_para(doc, edu_info1, edu_info2, edu_info_bwn)¶: Paragraph tuple features

educe.rst_dt.learning.features_dev.extract_pair_sent(doc, edu_info1, edu_info2, edu_info_bwn)¶: Sentence tuple features

educe.rst_dt.learning.features_dev.extract_pair_syntax(doc, edu_info1, edu_info2, edu_info_bwn)¶: syntactic features for the pair of EDUs

educe.rst_dt.learning.features_dev.extract_single_brown(doc, edu_info, para_info)¶: Brown cluster features for the EDU

educe.rst_dt.learning.features_dev.extract_single_length(doc, edu_info, para_info)¶: Sentence features for the EDU

educe.rst_dt.learning.features_dev.extract_single_para(doc, edu_info, para_info)¶: paragraph features for the EDU

educe.rst_dt.learning.features_dev.extract_single_pdtb_markers(doc, edu_info, para_info)¶: Features on the presence of PDTB discourse markers in the EDU

educe.rst_dt.learning.features_dev.extract_single_pos(doc, edu_info, para_info)¶: POS features for the EDU

educe.rst_dt.learning.features_dev.extract_single_sentence(doc, edu_info, para_info)¶: Sentence features for the EDU

educe.rst_dt.learning.features_dev.extract_single_syntax(doc, edu_info, para_info)¶: syntactic features for the EDU

educe.rst_dt.learning.features_dev.extract_single_typo(doc, edu_info, para_info)¶: typographical features for the EDU

educe.rst_dt.learning.features_dev.extract_single_word(doc, edu_info, para_info)¶: word features for the EDU

educe.rst_dt.learning.features_dev.is_title_cased(tok_seq)¶: True if a sequence of tokens is title-cased

educe.rst_dt.learning.features_dev.is_upper_entire(tok_seq)¶: True if a sequence is fully upper-cased

educe.rst_dt.learning.features_dev.is_upper_init(tok_seq)¶: True if a sequence starts with two upper-cased tokens

educe.rst_dt.learning.features_dev.product_features(feats_g, feats_d, feats_gd)¶

Generate features by taking the product of features.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	pf – product features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features_dev.split_feature_space(feats_g, feats_d, feats_gd, keep_original=False, split_criterion='dir')¶

Split feature space on a criterion.

Current supported criteria are: * ‘dir’: directionality of attachment, * ‘sent’: intra/inter-sentential, * ‘dir_sent’: directionality + intra/inter-sentential.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge keep_original (boolean, default=False) – whether to keep or replace the original features with the derived split features split_criterion (string) – feature(s) on which to split the feature space, options are ‘dir’ for directionality of attachment, ‘sent’ for intra/inter sentential, ‘dir_sent’ for their conjunction
Returns:	feats_g, feats_d, feats_gd – dicts of features with their copies
Return type:	(dict(feat_name, feat_val))

Notes

This function should probably be generalized and moved to a more relevant place.

educe.rst_dt.learning.features_dev.token_filter_li2014(token)¶

Token filter defined in Li et al.’s parser.

This filter only applies to tagged tokens.

educe.rst_dt.learning.features_li2014 module¶

Partial re-implementation of the feature extraction procedure used in [li2014text] for discourse dependency parsing on the RST-DT corpus.

[li2014text]

Li, S., Wang, L., Cao, Z., & Li, W. (2014).

Text-level discourse dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 25-35). http://www.aclweb.org/anthology/P/P14/P14-1003.pdf

educe.rst_dt.learning.features_li2014.build_doc_preprocessor()¶: Build the preprocessor for feature extraction in each EDU of doc

educe.rst_dt.learning.features_li2014.build_edu_feature_extractor()¶: Build the feature extractor for single EDUs

educe.rst_dt.learning.features_li2014.build_pair_feature_extractor()¶

Build the feature extractor for pairs of EDUs

TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names

educe.rst_dt.learning.features_li2014.combine_features(feats_g, feats_d, feats_gd)¶

Generate features by taking a (linear) combination of features.

I suspect these do not have a great impact, if any, on results.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	cf – combined features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features_li2014.extract_pair_length(doc, edu_info1, edu_info2)¶: Sentence tuple features

educe.rst_dt.learning.features_li2014.extract_pair_para(doc, edu_info1, edu_info2)¶: Paragraph tuple features

educe.rst_dt.learning.features_li2014.extract_pair_pos(doc, edu_info1, edu_info2)¶: POS tuple features

educe.rst_dt.learning.features_li2014.extract_pair_sent(doc, edu_info1, edu_info2)¶: Sentence tuple features

educe.rst_dt.learning.features_li2014.extract_pair_word(doc, edu_info1, edu_info2)¶: word tuple features

educe.rst_dt.learning.features_li2014.extract_single_length(doc, edu_info, para_info)¶: Sentence features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_para(doc, edu_info, para_info)¶: paragraph features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_pos(doc, edu_info, para_info)¶: POS features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_sentence(doc, edu_info, para_info)¶: Sentence features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_syntax(doc, edu_info, para_info)¶: syntactic features for the EDU

educe.rst_dt.learning.features_li2014.extract_single_word(doc, edu_info, para_info)¶: word features for the EDU

educe.rst_dt.learning.features_li2014.get_syntactic_labels(doc, edu_info)¶: Syntactic labels for this EDU

educe.rst_dt.learning.features_li2014.product_features(feats_g, feats_d, feats_gd)¶

Generate features by taking the product of features.

Parameters:	feats_g (dict(feat_name, feat_val)) – features of the gov EDU feats_d (dict(feat_name, feat_val)) – features of the dep EDU feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns:	pf – product features
Return type:	dict(feat_name, feat_val)

educe.rst_dt.learning.features_li2014.token_filter_li2014(token)¶

Token filter defined in Li et al.’s parser.

This filter only applies to tagged tokens.