educe.stac.learning package¶
Helpers for machine-learning tasks
Submodules¶
educe.stac.learning.addressee module¶
EDU addressee prediction
-
educe.stac.learning.addressee.
guess_addressees_for_edu
(contexts, players, edu)¶ return a set of possible addressees for the given EDU or None if unclear
At the moment, the basis for our guesses is very crude: we simply guess that we have an addresee if the EDU ends or starts with their name
-
educe.stac.learning.addressee.
is_emoticon
(token)¶ True if the token is tagged as an emoticon
-
educe.stac.learning.addressee.
is_preposition
(token)¶ True if the token is tagged as a preposition
-
educe.stac.learning.addressee.
is_punct
(token)¶ True if the token is tagged as punctuation
-
educe.stac.learning.addressee.
is_verb
(token)¶ True if the token is tagged as a verb
educe.stac.learning.doc_vectorizer module¶
This submodule implements document vectorizers
-
class
educe.stac.learning.doc_vectorizer.
DialogueActVectorizer
(instance_generator, labels)¶ Bases:
object
Dialogue act extractor for the STAC corpus.
-
transform
(raw_documents)¶ Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
Parameters: raw_documents (list of educe.stac.fusion.Dialogue) – List of dialogues. Yields: inst_lbl (int) – (Integer) label for the next instance.
-
-
class
educe.stac.learning.doc_vectorizer.
LabelVectorizer
(instance_generator, labels, zero=False)¶ Bases:
object
Label extractor for the STAC corpus.
-
transform
(raw_documents)¶ Learn the label encoder and return a vector of labels
There is one label per instance extracted from raw_documents.
Parameters: raw_documents (list of ?) – Raw documents. Yields: inst_lbl (int) – (Integer) label for the next instance.
-
educe.stac.learning.features module¶
Feature extraction library functions for STAC corpora. The feature extraction script (rel-info) is a lightweight frontend to this library
-
exception
educe.stac.learning.features.
CorpusConsistencyException
(msg)¶ Bases:
exceptions.Exception
Exceptions which arise if one of our expecations about the corpus data is violated, in short, weird things we don’t know how to handle. We should avoid using this for things which are definitely bugs in the code, and not just weird things in the corpus we didn’t know how to handle.
-
class
educe.stac.learning.features.
DocEnv
(inputs, current, sf_cache)¶ Bases:
tuple
-
current
¶ Alias for field number 1
-
inputs
¶ Alias for field number 0
-
sf_cache
¶ Alias for field number 2
-
-
class
educe.stac.learning.features.
DocumentPlus
(key, doc, unitdoc, players, parses)¶ Bases:
tuple
-
doc
¶ Alias for field number 1
-
key
¶ Alias for field number 0
-
parses
¶ Alias for field number 4
-
players
¶ Alias for field number 3
-
unitdoc
¶ Alias for field number 2
-
-
class
educe.stac.learning.features.
EduGap
(sf_cache, inner_edus, turns_between)¶ Bases:
tuple
-
inner_edus
¶ Alias for field number 1
-
sf_cache
¶ Alias for field number 0
-
turns_between
¶ Alias for field number 2
-
-
class
educe.stac.learning.features.
FeatureCache
(inputs, current)¶ Bases:
dict
Cache for single edu features. Retrieving an item from the cache lazily computes/memoises the single EDU features for it.
-
expire
(edu)¶ Remove an edu from the cache if it’s in there
-
-
class
educe.stac.learning.features.
FeatureInput
(corpus, postags, parses, lexicons, pdtb_lex, verbnet_entries, inquirer_lex)¶ Bases:
tuple
-
corpus
¶ Alias for field number 0
-
inquirer_lex
¶ Alias for field number 6
-
lexicons
¶ Alias for field number 3
-
parses
¶ Alias for field number 2
-
pdtb_lex
¶ Alias for field number 4
Alias for field number 1
-
verbnet_entries
¶ Alias for field number 5
-
-
class
educe.stac.learning.features.
InquirerLexKeyGroup
(lexicon)¶ Bases:
educe.learning.keys.KeyGroup
One feature per Inquirer lexicon class
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup
-
classmethod
key_prefix
()¶ All feature keys in this lexicon should start with this string
-
mk_field
(entry)¶ From verb class to feature key
-
mk_fields
()¶ Feature name for each relation in the lexicon
-
-
class
educe.stac.learning.features.
LexKeyGroup
(lexicon)¶ Bases:
educe.learning.keys.KeyGroup
The idea here is to provide a feature per lexical class in the lexicon entry
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup
-
key_prefix
()¶ Common CSV header name prefix to all columns based on this particular lexicon
-
mk_field
(cname, subclass=None)¶ For a given lexical class, return the name of its feature in the CSV file
-
mk_fields
()¶ CSV field names for each entry/class in the lexicon
-
-
class
educe.stac.learning.features.
LexWrapper
(key, filename, classes)¶ Bases:
object
Configuration options for a given lexicon: where to find it, what to call it, what sorts of results to return
-
read
(lexdir)¶ Read and store the lexicon as a mapping from words to their classes
-
-
class
educe.stac.learning.features.
MergedLexKeyGroup
(inputs)¶ Bases:
educe.learning.keys.MergedKeyGroup
Single-EDU features based on lexical lookup.
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup
-
-
class
educe.stac.learning.features.
PairKeys
(inputs, sf_cache=None)¶ Bases:
educe.learning.keys.MergedKeyGroup
Features for pairs of EDUs
-
fill
(current, edu1, edu2, target=None)¶ See PairSubgroup
-
one_hot_values_gen
(suffix='')¶
-
-
class
educe.stac.learning.features.
PairSubgroup
(description, keys)¶ Bases:
educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged PairKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the bits of code that also fill them out
-
fill
(current, edu1, edu2, target=None)¶ Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged key group, you may find it desirable to fill out the merged group instead)
-
-
class
educe.stac.learning.features.
PairSubgroup_Gap
(sf_cache)¶ Bases:
educe.stac.learning.features.PairSubgroup
Features related to the combined surrounding context of the two EDUs
-
fill
(current, edu1, edu2, target=None)¶
-
-
class
educe.stac.learning.features.
PairSubgroup_Tuple
(inputs, sf_cache)¶ Bases:
educe.stac.learning.features.PairSubgroup
artificial tuple features
-
fill
(current, edu1, edu2, target=None)¶
-
-
class
educe.stac.learning.features.
PdtbLexKeyGroup
(lexicon)¶ Bases:
educe.learning.keys.KeyGroup
One feature per PDTB marker lexicon class
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup
-
classmethod
key_prefix
()¶ All feature keys in this lexicon should start with this string
-
mk_field
(rel)¶ From relation name to feature key
-
mk_fields
()¶ Feature name for each relation in the lexicon
-
-
class
educe.stac.learning.features.
SingleEduKeys
(inputs)¶ Bases:
educe.learning.keys.MergedKeyGroup
Features for a single EDU
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup.fill
-
-
class
educe.stac.learning.features.
SingleEduSubgroup
(description, keys)¶ Bases:
educe.learning.keys.KeyGroup
Abstract keygroup for subgroups of the merged SingleEduKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the bits of code that also fill them out
-
fill
(current, edu, target=None)¶ Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged key group, you may find it desirable to fill out the merged group instead)
This defaults to _magic_fill if you don’t implement it.
-
-
class
educe.stac.learning.features.
SingleEduSubgroup_Chat
¶ Bases:
educe.stac.learning.features.SingleEduSubgroup
Single-EDU features based on the EDU’s relationship with the chat structure (eg turns, dialogues).
-
class
educe.stac.learning.features.
SingleEduSubgroup_Parser
¶ Bases:
educe.stac.learning.features.SingleEduSubgroup
Single-EDU features that come out of a syntactic parser.
-
class
educe.stac.learning.features.
SingleEduSubgroup_Punct
¶ Bases:
educe.stac.learning.features.SingleEduSubgroup
punctuation features
-
class
educe.stac.learning.features.
SingleEduSubgroup_Token
¶ Bases:
educe.stac.learning.features.SingleEduSubgroup
word/token-based features
-
class
educe.stac.learning.features.
VerbNetEntry
(classname, lemmas)¶ Bases:
tuple
-
classname
¶ Alias for field number 0
-
lemmas
¶ Alias for field number 1
-
-
class
educe.stac.learning.features.
VerbNetLexKeyGroup
(ventries)¶ Bases:
educe.learning.keys.KeyGroup
One feature per VerbNet lexicon class
-
fill
(current, edu, target=None)¶ See SingleEduSubgroup
-
classmethod
key_prefix
()¶ All feature keys in this lexicon should start with this string
-
mk_field
(ventry)¶ From verb class to feature key
-
mk_fields
()¶ Feature name for each relation in the lexicon
-
-
educe.stac.learning.features.
clean_chat_word
(token)¶ Given a word and its postag (educe PosTag representation) return a somewhat tidied up version of the word.
- Sequences of the same letter greater than length 3 are shortened to just length three
- Letter is lower cased
-
educe.stac.learning.features.
clean_dialogue_act
(act)¶ Knock out temporary markers used during corpus annotation
-
educe.stac.learning.features.
dialogue_act_pairs
(current, cache, edu1, edu2)¶ tuple of dialogue acts for both EDUs
-
educe.stac.learning.features.
edu_position_in_turn
(_, edu)¶ relative position of the EDU in the turn
-
educe.stac.learning.features.
edu_text_feature
(wrapped)¶ Lift a text based feature into a standard single EDU one
(String -> a) -> ((Current, Edu) -> a)
-
educe.stac.learning.features.
emoticons
(tokens)¶ Given some tokens, return just those which are emoticons
-
educe.stac.learning.features.
enclosed_lemmas
(span, parses)¶ Given a span and a list of parses, return any lemmas that are within that span
-
educe.stac.learning.features.
enclosed_trees
(span, trees)¶ Return the biggest (sub)trees in xs that are enclosed in the span
-
educe.stac.learning.features.
ends_with_bang
(current, edu)¶ if the EDU text ends with ‘!’
-
educe.stac.learning.features.
ends_with_qmark
(current, edu)¶ if the EDU text ends with ‘?’
-
educe.stac.learning.features.
extract_pair_features
(inputs, stage)¶ Extraction for all relevant pairs in a document (generator)
-
educe.stac.learning.features.
extract_single_features
(inputs, stage)¶ Return a dictionary for each EDU
-
educe.stac.learning.features.
feat_annotator
(current, edu1, edu2)¶ annotator for the subdoc
-
educe.stac.learning.features.
feat_end
(_, edu)¶ text span end
-
educe.stac.learning.features.
feat_has_emoticons
(_, edu)¶ if the EDU has emoticon-tagged tokens
-
educe.stac.learning.features.
feat_id
(_, edu)¶ some sort of unique identifier for the EDU
-
educe.stac.learning.features.
feat_is_emoticon_only
(_, edu)¶ if the EDU consists solely of an emoticon
-
educe.stac.learning.features.
feat_start
(_, edu)¶ text span start
-
educe.stac.learning.features.
get_players
(inputs)¶ Return a dictionary mapping each document to the set of players in that document
-
educe.stac.learning.features.
has_FOR_np
(current, edu)¶ if the EDU has the pattern IN(for).. NP
-
educe.stac.learning.features.
has_correction_star
(current, edu)¶ if the EDU begins with a ‘*’ but does not contain others
-
educe.stac.learning.features.
has_inner_question
(current, gap, _edu1, _edu2)¶ if there is an intervening EDU that is a question
-
educe.stac.learning.features.
has_one_of_words
(sought, tokens, norm=<function <lambda>>)¶ Given a set of words, a collection tokens, return True if the tokens contain words match one of the desired words, modulo some minor normalisations like lowercasing.
-
educe.stac.learning.features.
has_pdtb_markers
(markers, tokens)¶ Given a sequence of tagged tokens, return True if any of the given PDTB markers appears within the tokens
-
educe.stac.learning.features.
has_player_name_exact
(current, edu)¶ if the EDU text has a player name in it
-
educe.stac.learning.features.
has_player_name_fuzzy
(current, edu)¶ if the EDU has a word that sounds like a player name
-
educe.stac.learning.features.
is_just_emoticon
(tokens)¶ Return true if a sequence of tokens consists of a single emoticon
-
educe.stac.learning.features.
is_nplike
(anno)¶ is some sort of NP annotation from a parser
-
educe.stac.learning.features.
is_question
(current, edu)¶ if the EDU is (or contains) a question
-
educe.stac.learning.features.
is_question_pairs
(current, cache, edu1, edu2)¶ boolean tuple: if each EDU is a question
-
educe.stac.learning.features.
lemma_subject
(*args, **kwargs)¶ the lemma corresponding to the subject of this EDU
-
educe.stac.learning.features.
lexical_markers
(lclass, tokens)¶ Given a dictionary (words to categories) and a text span, return all the categories of words that appear in that set.
Note that for now we are doing our own white-space based tokenisation, but it could make sense to use a different source of tokens instead
-
educe.stac.learning.features.
map_topdown
(good, prunable, trees)¶ Do topdown search on all these trees, concatenate results.
-
educe.stac.learning.features.
mk_env
(inputs, people, key)¶ Pre-process and bundle up a representation of the current document
-
educe.stac.learning.features.
mk_envs
(inputs, stage)¶ Generate an environment for each document in the corpus within the given stage.
The environment pools together all the information we have on a single document
-
educe.stac.learning.features.
mk_high_level_dialogues
(inputs, stage)¶ Generate all relevant EDU pairs for a document (generator)
-
educe.stac.learning.features.
mk_is_interesting
(args, single)¶ Return a function that filters corpus keys to pick out the ones we specified on the command line
We have two cases here: for pair extraction, we just want to grab the units and if possible the discourse stage. In live mode, there won’t be a discourse stage, but that’s fine because we can just fall back on units.
For single extraction (dialogue acts), we’ll also want to grab the units stage and fall back to unannotated when in live mode. This is made a bit trickier by the fact that unannotated does not have an annotator, so we have to accomodate that.
Phew.
It’s a bit specific to feature extraction in that here we are trying
-
educe.stac.learning.features.
num_edus_between
(_current, gap, _edu1, _edu2)¶ number of intervening EDUs (0 if adjacent)
-
educe.stac.learning.features.
num_nonling_tstars_between
(_current, gap, _edu1, _edu2)¶ number of non-linguistic turn-stars between EDUs
-
educe.stac.learning.features.
num_speakers_between
(_current, gap, _edu1, _edu2)¶ number of distinct speakers in intervening EDUs
-
educe.stac.learning.features.
num_tokens
(_, edu)¶ length of this EDU in tokens
-
educe.stac.learning.features.
player_addresees
(edu)¶ The set of people spoken to during an edu annotation. This excludes known non-players, like ‘All’, or ‘?’, or ‘Please choose...’,
-
educe.stac.learning.features.
players_for_doc
(corpus, kdoc)¶ Return the set of speakers/addressees associated with a document.
In STAC, documents are semi-arbitrarily cut into sub-documents for technical and possibly ergonomic reasons, ie. meaningless as far as we are concerned. So to find all speakers, we would have to search all the subdocuments of a single document.
(Corpus, String) -> Set String
-
educe.stac.learning.features.
position_in_dialogue
(_, edu)¶ relative position of the turn in the dialogue
-
educe.stac.learning.features.
position_in_game
(_, edu)¶ relative position of the turn in the game
-
educe.stac.learning.features.
position_of_speaker_first_turn
(edu)¶ Given an EDU context, determine the position of the first turn by that EDU’s speaker relative to other turns in that dialogue.
-
educe.stac.learning.features.
read_corpus_inputs
(args)¶ Read and filter the part of the corpus we want features for
-
educe.stac.learning.features.
read_pdtb_lexicon
(args)¶ Read and return the local PDTB discourse marker lexicon.
-
educe.stac.learning.features.
real_dialogue_act
(edu)¶ Given an EDU in the ‘discourse’ stage of the corpus, return its dialogue act from the ‘units’ stage
-
educe.stac.learning.features.
relation_dict
(doc, quiet=False)¶ Return the relations instances from a document in the form of an id pair to label dictionary
If there is more than one relation between a pair of EDUs we pick one of them arbitrarily and ignore the other
-
educe.stac.learning.features.
same_speaker
(current, _, edu1, edu2)¶ if both EDUs have the same speaker
-
educe.stac.learning.features.
same_turn
(current, _, edu1, edu2)¶ if both EDUs are in the same turn
-
educe.stac.learning.features.
speaker_already_spoken_in_dialogue
(_, edu)¶ if the speaker for this EDU is the same as that of a previous turn in the dialogue
-
educe.stac.learning.features.
speaker_id
(_, edu)¶ Get the speaker ID
-
educe.stac.learning.features.
speaker_started_the_dialogue
(_, edu)¶ if the speaker for this EDU is the same as that of the first turn in the dialogue
-
educe.stac.learning.features.
speakers_first_turn_in_dialogue
(_, edu)¶ position in the dialogue of the turn in which the speaker for this EDU first spoke
-
educe.stac.learning.features.
strip_cdus
(corpus, mode)¶ For all documents in a corpus, remove any CDUs and relink the document according to the desired mode. This mutates the corpus.
-
educe.stac.learning.features.
subject_lemmas
(span, trees)¶ Given a span and a list of dependency trees, return any lemmas which are marked as being some subject in that span
-
educe.stac.learning.features.
turn_follows_gap
(_, edu)¶ if the EDU turn number is > 1 + previous turn
-
educe.stac.learning.features.
type_text
(wrapped)¶ Given a feature that emits text, clean its output up so to work with a wide variety of csv parsers
(a -> String) -> (a -> String)
-
educe.stac.learning.features.
word_first
(*args, **kwargs)¶ the first word in this EDU
-
educe.stac.learning.features.
word_last
(*args, **kwargs)¶ the last word in this EDU