educe.rst_dt.learning package¶
Submodules¶
educe.rst_dt.learning.args module¶
Command line options for learning commands
-
class
educe.rst_dt.learning.args.
FeatureSetAction
(option_strings, dest, nargs=None, **kwargs)¶ Bases:
argparse.Action
Select the desired feature set
-
educe.rst_dt.learning.args.
add_usual_input_args
(parser)¶ Augment a subcommand argparser with typical input arguments. Sometimes your subcommand may require slightly different output arguments, in which case, just don’t call this function.
educe.rst_dt.learning.base module¶
Basics for feature extraction
-
class
educe.rst_dt.learning.base.
DocumentPlusPreprocessor
(token_filter=None, word2clust=None)¶ Bases:
object
Preprocessor for feature extraction on a DocumentPlus
This pre-processor currently does not explicitly impute missing values, but it probably should eventually. As the ultimate output is features in a sparse format, the current strategy amounts to imputing missing values as 0, which is most certainly not optimal.
-
preprocess
(doc, strict=False)¶ Preprocess a document and output basic features for each EDU.
Parameters: doc (DocumentPlus) – Document to be processed. Returns: - edu_infos (list of dict of features) – List of basic features for each EDU ; each feature is a couple (basic_feat_name, basic_feat_val).
- para_infos (list of dict of features) – List of basic features for each paragraph ; each feature is a couple (basic_feat_name, basic_feat_val).
-
-
exception
educe.rst_dt.learning.base.
FeatureExtractionException
(msg)¶ Bases:
exceptions.Exception
Exceptions related to RST trees not looking like we would expect them to
-
educe.rst_dt.learning.base.
edu_feature
(wrapped)¶ Lift a function from edu -> feature to single_function_input -> feature
-
educe.rst_dt.learning.base.
edu_pair_feature
(wrapped)¶ Lifts a function from (edu, edu) -> f to pair_function_input -> f
-
educe.rst_dt.learning.base.
lowest_common_parent
(treepositions)¶ Find tree position of the lowest common parent of a list of nodes.
Parameters: treepositions ( list
of tree positions) – see nltk.tree.Tree.treepositions()Returns: tpos_parent – Tree position of the lowest common parent to all the given tree positions. Return type: tree position
-
educe.rst_dt.learning.base.
on_first_bigram
(wrapped)¶ Lift a function from a -> string to [a] -> string the function will be applied to the up to first two elements of the list and the result concatenated. It returns None if the list is empty
-
educe.rst_dt.learning.base.
on_first_unigram
(wrapped)¶ Lift a function from a -> b to [a] -> b taking the first item or returning None if empty list
-
educe.rst_dt.learning.base.
on_last_bigram
(wrapped)¶ Lift a function from a -> string to [a] -> string the function will be applied to the up to the two elements of the list and the result concatenated. It returns None if the list is empty
-
educe.rst_dt.learning.base.
on_last_unigram
(wrapped)¶ Lift a function from a -> b to [a] -> b taking the last item or returning None if empty list
educe.rst_dt.learning.doc_vectorizer module¶
This submodule implements document vectorizers
-
class
educe.rst_dt.learning.doc_vectorizer.
DocumentCountVectorizer
(instance_generator, feature_set, lecsie_data_dir=None, max_df=1.0, min_df=1, max_features=None, vocabulary=None, separator='=', split_feat_space=None)¶ Bases:
object
Fancy vectorizer for the RST-DT treebank.
See sklearn.feature_extraction.text.CountVectorizer for reference.
-
build_analyzer
()¶ Return a callable that extracts feature vectors from a doc
-
decode
(doc)¶ Decode the input into a DocumentPlus.
Currently a no-op except for type checking.
Parameters: doc (educe.rst_dt.document_plus.DocumentPlus) – Rich representation of the document. Returns: doc – Rich representation of the document. Return type: educe.rst_dt.document_plus.DocumentPlus
-
fit
(raw_documents, y=None)¶ Learn a vocabulary dictionary of all features from the documents
-
fit_transform
(raw_documents, y=None)¶ Learn the vocabulary dictionary and generate a feature matrix per document.
-
transform
(raw_documents)¶ Transform documents to a feature matrix.
Generate a feature matrix, one row per instance.
Parameters: raw_documents (TODO) – TODO Yields: row ((row, (tgt, src))) – Feature vector for the next instance.
-
-
class
educe.rst_dt.learning.doc_vectorizer.
DocumentLabelExtractor
(instance_generator, ordered_pairs=True, unknown_label='__UNK__', labelset=None)¶ Bases:
object
Label extractor for the RST-DT treebank.
-
fixed_labelset_
¶ boolean – True if the labelset has been fixed, i.e. self has been fit.
-
labelset_
¶ dict – A mapping of labels to indices.
-
build_analyzer
()¶ Return a callable that extracts feature vectors from a doc
-
decode
(doc)¶ Currently a no-op if doc is a DocumentPlus.
Raises an exception otherwise. Was: Decode the input into a DocumentPlus.
Parameters: doc (DocumentPlus) – Rich representation of the document. Returns: doc – Rich representation of doc. Return type: DocumentPlus
-
fit
(raw_documents)¶ Learn a labelset from the documents
-
fit_transform
(raw_documents)¶ Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
-
transform
(raw_documents)¶ Transform documents to a label vector
-
-
educe.rst_dt.learning.doc_vectorizer.
re_emit
(feats, suff)¶ Re-emit feats with suff appended to each feature name
educe.rst_dt.learning.features module¶
Feature extraction library functions for RST_DT corpus
-
educe.rst_dt.learning.features.
build_doc_preprocessor
()¶ Build the preprocessor for feature extraction in each EDU of doc
-
educe.rst_dt.learning.features.
build_edu_feature_extractor
()¶ Build the feature extractor for single EDUs
-
educe.rst_dt.learning.features.
build_pair_feature_extractor
()¶ Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names
-
educe.rst_dt.learning.features.
combine_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: cf – combined features
Return type: dict(feat_name, feat_val)
-
educe.rst_dt.learning.features.
extract_pair_gap
(doc, edu_info1, edu_info2)¶ Document tuple features
POS tag features on EDU pairs
-
educe.rst_dt.learning.features.
extract_pair_raw_word
(doc, edu_info1, edu_info2)¶ raw word features on EDU pairs
-
educe.rst_dt.learning.features.
extract_single_ptb_token_pos
(doc, edu_info, para_info)¶ POS features on PTB tokens for the EDU
-
educe.rst_dt.learning.features.
extract_single_ptb_token_word
(doc, edu_info, para_info)¶ word features on PTB tokens for the EDU
-
educe.rst_dt.learning.features.
extract_single_raw_word
(doc, edu_info, para_info)¶ raw word features for the EDU
-
educe.rst_dt.learning.features.
product_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking the product of features.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: pf – product features
Return type: dict(feat_name, feat_val)
educe.rst_dt.learning.features_dev module¶
Experimental features.
-
class
educe.rst_dt.learning.features_dev.
LecsieFeats
(lecsie_data_dir)¶ Bases:
object
Extract Lecsie features from each pair of EDUs
-
fit
(edu_pairs, y=None)¶ Fit the feature extractor.
Currently a no-op.
Parameters: - edu_pairs (TODO) – TODO
- y (TODO, optional) – TODO
Returns: self – TODO
Return type: TODO
-
transform
(edu_pairs)¶ Extract lecsie features for pairs of EDUs.
This is a generator.
Parameters: edu_pairs (TODO) – TODO Returns: res – TODO Return type: TODO
-
-
educe.rst_dt.learning.features_dev.
build_doc_preprocessor
()¶ Build the preprocessor for feature extraction in each EDU of doc
-
educe.rst_dt.learning.features_dev.
build_edu_feature_extractor
()¶ Build the feature extractor for single EDUs
-
educe.rst_dt.learning.features_dev.
build_pair_feature_extractor
(lecsie_data_dir=None)¶ Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names
-
educe.rst_dt.learning.features_dev.
combine_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: cf – combined features
Return type: dict(feat_name, feat_val)
-
educe.rst_dt.learning.features_dev.
extract_pair_doc
(doc, edu_info1, edu_info2, edu_info_bwn)¶ Document-level tuple features
-
educe.rst_dt.learning.features_dev.
extract_pair_para
(doc, edu_info1, edu_info2, edu_info_bwn)¶ Paragraph tuple features
-
educe.rst_dt.learning.features_dev.
extract_pair_sent
(doc, edu_info1, edu_info2, edu_info_bwn)¶ Sentence tuple features
-
educe.rst_dt.learning.features_dev.
extract_pair_syntax
(doc, edu_info1, edu_info2, edu_info_bwn)¶ syntactic features for the pair of EDUs
-
educe.rst_dt.learning.features_dev.
extract_single_brown
(doc, edu_info, para_info)¶ Brown cluster features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_length
(doc, edu_info, para_info)¶ Sentence features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_para
(doc, edu_info, para_info)¶ paragraph features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_pdtb_markers
(doc, edu_info, para_info)¶ Features on the presence of PDTB discourse markers in the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_pos
(doc, edu_info, para_info)¶ POS features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_sentence
(doc, edu_info, para_info)¶ Sentence features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_syntax
(doc, edu_info, para_info)¶ syntactic features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_typo
(doc, edu_info, para_info)¶ typographical features for the EDU
-
educe.rst_dt.learning.features_dev.
extract_single_word
(doc, edu_info, para_info)¶ word features for the EDU
-
educe.rst_dt.learning.features_dev.
is_title_cased
(tok_seq)¶ True if a sequence of tokens is title-cased
-
educe.rst_dt.learning.features_dev.
is_upper_entire
(tok_seq)¶ True if a sequence is fully upper-cased
-
educe.rst_dt.learning.features_dev.
is_upper_init
(tok_seq)¶ True if a sequence starts with two upper-cased tokens
-
educe.rst_dt.learning.features_dev.
product_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking the product of features.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: pf – product features
Return type: dict(feat_name, feat_val)
-
educe.rst_dt.learning.features_dev.
split_feature_space
(feats_g, feats_d, feats_gd, keep_original=False, split_criterion='dir')¶ Split feature space on a criterion.
Current supported criteria are: * ‘dir’: directionality of attachment, * ‘sent’: intra/inter-sentential, * ‘dir_sent’: directionality + intra/inter-sentential.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
- keep_original (boolean, default=False) – whether to keep or replace the original features with the derived split features
- split_criterion (string) – feature(s) on which to split the feature space, options are ‘dir’ for directionality of attachment, ‘sent’ for intra/inter sentential, ‘dir_sent’ for their conjunction
Returns: feats_g, feats_d, feats_gd – dicts of features with their copies
Return type: (dict(feat_name, feat_val))
Notes
This function should probably be generalized and moved to a more relevant place.
-
educe.rst_dt.learning.features_dev.
token_filter_li2014
(token)¶ Token filter defined in Li et al.’s parser.
This filter only applies to tagged tokens.
educe.rst_dt.learning.features_li2014 module¶
Partial re-implementation of the feature extraction procedure used in [li2014text] for discourse dependency parsing on the RST-DT corpus.
[li2014text] | Li, S., Wang, L., Cao, Z., & Li, W. (2014). |
Text-level discourse dependency parsing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 1, pp. 25-35). http://www.aclweb.org/anthology/P/P14/P14-1003.pdf
-
educe.rst_dt.learning.features_li2014.
build_doc_preprocessor
()¶ Build the preprocessor for feature extraction in each EDU of doc
-
educe.rst_dt.learning.features_li2014.
build_edu_feature_extractor
()¶ Build the feature extractor for single EDUs
-
educe.rst_dt.learning.features_li2014.
build_pair_feature_extractor
()¶ Build the feature extractor for pairs of EDUs
TODO: properly emit features on single EDUs ; they are already stored in sf_cache, but under (slightly) different names
-
educe.rst_dt.learning.features_li2014.
combine_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking a (linear) combination of features.
I suspect these do not have a great impact, if any, on results.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: cf – combined features
Return type: dict(feat_name, feat_val)
-
educe.rst_dt.learning.features_li2014.
extract_pair_length
(doc, edu_info1, edu_info2)¶ Sentence tuple features
-
educe.rst_dt.learning.features_li2014.
extract_pair_para
(doc, edu_info1, edu_info2)¶ Paragraph tuple features
-
educe.rst_dt.learning.features_li2014.
extract_pair_pos
(doc, edu_info1, edu_info2)¶ POS tuple features
-
educe.rst_dt.learning.features_li2014.
extract_pair_sent
(doc, edu_info1, edu_info2)¶ Sentence tuple features
-
educe.rst_dt.learning.features_li2014.
extract_pair_word
(doc, edu_info1, edu_info2)¶ word tuple features
-
educe.rst_dt.learning.features_li2014.
extract_single_length
(doc, edu_info, para_info)¶ Sentence features for the EDU
-
educe.rst_dt.learning.features_li2014.
extract_single_para
(doc, edu_info, para_info)¶ paragraph features for the EDU
-
educe.rst_dt.learning.features_li2014.
extract_single_pos
(doc, edu_info, para_info)¶ POS features for the EDU
-
educe.rst_dt.learning.features_li2014.
extract_single_sentence
(doc, edu_info, para_info)¶ Sentence features for the EDU
-
educe.rst_dt.learning.features_li2014.
extract_single_syntax
(doc, edu_info, para_info)¶ syntactic features for the EDU
-
educe.rst_dt.learning.features_li2014.
extract_single_word
(doc, edu_info, para_info)¶ word features for the EDU
-
educe.rst_dt.learning.features_li2014.
get_syntactic_labels
(doc, edu_info)¶ Syntactic labels for this EDU
-
educe.rst_dt.learning.features_li2014.
product_features
(feats_g, feats_d, feats_gd)¶ Generate features by taking the product of features.
Parameters: - feats_g (dict(feat_name, feat_val)) – features of the gov EDU
- feats_d (dict(feat_name, feat_val)) – features of the dep EDU
- feats_gd (dict(feat_name, feat_val)) – features of the (gov, dep) edge
Returns: pf – product features
Return type: dict(feat_name, feat_val)
-
educe.rst_dt.learning.features_li2014.
token_filter_li2014
(token)¶ Token filter defined in Li et al.’s parser.
This filter only applies to tagged tokens.