educe.stac.learning package

Helpers for machine-learning tasks

Submodules

educe.stac.learning.addressee module

EDU addressee prediction

educe.stac.learning.addressee.guess_addressees_for_edu(contexts, players, edu)

return a set of possible addressees for the given EDU or None if unclear

At the moment, the basis for our guesses is very crude: we simply guess that we have an addresee if the EDU ends or starts with their name

educe.stac.learning.addressee.is_emoticon(token)

True if the token is tagged as an emoticon

educe.stac.learning.addressee.is_preposition(token)

True if the token is tagged as a preposition

educe.stac.learning.addressee.is_punct(token)

True if the token is tagged as punctuation

educe.stac.learning.addressee.is_verb(token)

True if the token is tagged as a verb

educe.stac.learning.doc_vectorizer module

This submodule implements document vectorizers

class educe.stac.learning.doc_vectorizer.DialogueActVectorizer(instance_generator, labels)

Bases: object

Dialogue act extractor for the STAC corpus.

transform(raw_documents)

Learn the label encoder and return a vector of labels

There is one label per instance extracted from raw_documents.

Parameters:raw_documents (list of educe.stac.fusion.Dialogue) – List of dialogues.
Yields:inst_lbl (int) – (Integer) label for the next instance.
class educe.stac.learning.doc_vectorizer.LabelVectorizer(instance_generator, labels, zero=False)

Bases: object

Label extractor for the STAC corpus.

transform(raw_documents)

Learn the label encoder and return a vector of labels

There is one label per instance extracted from raw_documents.

Parameters:raw_documents (list of ?) – Raw documents.
Yields:inst_lbl (int) – (Integer) label for the next instance.

educe.stac.learning.features module

Feature extraction library functions for STAC corpora. The feature extraction script (rel-info) is a lightweight frontend to this library

exception educe.stac.learning.features.CorpusConsistencyException(msg)

Bases: exceptions.Exception

Exceptions which arise if one of our expecations about the corpus data is violated, in short, weird things we don’t know how to handle. We should avoid using this for things which are definitely bugs in the code, and not just weird things in the corpus we didn’t know how to handle.

class educe.stac.learning.features.DocEnv(inputs, current, sf_cache)

Bases: tuple

current

Alias for field number 1

inputs

Alias for field number 0

sf_cache

Alias for field number 2

class educe.stac.learning.features.DocumentPlus(key, doc, unitdoc, players, parses)

Bases: tuple

doc

Alias for field number 1

key

Alias for field number 0

parses

Alias for field number 4

players

Alias for field number 3

unitdoc

Alias for field number 2

class educe.stac.learning.features.EduGap(sf_cache, inner_edus, turns_between)

Bases: tuple

inner_edus

Alias for field number 1

sf_cache

Alias for field number 0

turns_between

Alias for field number 2

class educe.stac.learning.features.FeatureCache(inputs, current)

Bases: dict

Cache for single edu features. Retrieving an item from the cache lazily computes/memoises the single EDU features for it.

expire(edu)

Remove an edu from the cache if it’s in there

class educe.stac.learning.features.FeatureInput(corpus, postags, parses, lexicons, pdtb_lex, verbnet_entries, inquirer_lex)

Bases: tuple

corpus

Alias for field number 0

inquirer_lex

Alias for field number 6

lexicons

Alias for field number 3

parses

Alias for field number 2

pdtb_lex

Alias for field number 4

postags

Alias for field number 1

verbnet_entries

Alias for field number 5

class educe.stac.learning.features.InquirerLexKeyGroup(lexicon)

Bases: educe.learning.keys.KeyGroup

One feature per Inquirer lexicon class

fill(current, edu, target=None)

See SingleEduSubgroup

classmethod key_prefix()

All feature keys in this lexicon should start with this string

mk_field(entry)

From verb class to feature key

mk_fields()

Feature name for each relation in the lexicon

class educe.stac.learning.features.LexKeyGroup(lexicon)

Bases: educe.learning.keys.KeyGroup

The idea here is to provide a feature per lexical class in the lexicon entry

fill(current, edu, target=None)

See SingleEduSubgroup

key_prefix()

Common CSV header name prefix to all columns based on this particular lexicon

mk_field(cname, subclass=None)

For a given lexical class, return the name of its feature in the CSV file

mk_fields()

CSV field names for each entry/class in the lexicon

class educe.stac.learning.features.LexWrapper(key, filename, classes)

Bases: object

Configuration options for a given lexicon: where to find it, what to call it, what sorts of results to return

read(lexdir)

Read and store the lexicon as a mapping from words to their classes

class educe.stac.learning.features.MergedLexKeyGroup(inputs)

Bases: educe.learning.keys.MergedKeyGroup

Single-EDU features based on lexical lookup.

fill(current, edu, target=None)

See SingleEduSubgroup

class educe.stac.learning.features.PairKeys(inputs, sf_cache=None)

Bases: educe.learning.keys.MergedKeyGroup

Features for pairs of EDUs

fill(current, edu1, edu2, target=None)

See PairSubgroup

one_hot_values_gen(suffix='')
class educe.stac.learning.features.PairSubgroup(description, keys)

Bases: educe.learning.keys.KeyGroup

Abstract keygroup for subgroups of the merged PairKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the bits of code that also fill them out

fill(current, edu1, edu2, target=None)

Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged key group, you may find it desirable to fill out the merged group instead)

class educe.stac.learning.features.PairSubgroup_Gap(sf_cache)

Bases: educe.stac.learning.features.PairSubgroup

Features related to the combined surrounding context of the two EDUs

fill(current, edu1, edu2, target=None)
class educe.stac.learning.features.PairSubgroup_Tuple(inputs, sf_cache)

Bases: educe.stac.learning.features.PairSubgroup

artificial tuple features

fill(current, edu1, edu2, target=None)
class educe.stac.learning.features.PdtbLexKeyGroup(lexicon)

Bases: educe.learning.keys.KeyGroup

One feature per PDTB marker lexicon class

fill(current, edu, target=None)

See SingleEduSubgroup

classmethod key_prefix()

All feature keys in this lexicon should start with this string

mk_field(rel)

From relation name to feature key

mk_fields()

Feature name for each relation in the lexicon

class educe.stac.learning.features.SingleEduKeys(inputs)

Bases: educe.learning.keys.MergedKeyGroup

Features for a single EDU

fill(current, edu, target=None)

See SingleEduSubgroup.fill

class educe.stac.learning.features.SingleEduSubgroup(description, keys)

Bases: educe.learning.keys.KeyGroup

Abstract keygroup for subgroups of the merged SingleEduKeys. We use these subgroup classes to help provide modularity, to capture the idea that the bits of code that define a set of related feature vector keys should go with the bits of code that also fill them out

fill(current, edu, target=None)

Fill out a vector’s features (if the vector is None, then we just fill out this group; but in the case of a merged key group, you may find it desirable to fill out the merged group instead)

This defaults to _magic_fill if you don’t implement it.

class educe.stac.learning.features.SingleEduSubgroup_Chat

Bases: educe.stac.learning.features.SingleEduSubgroup

Single-EDU features based on the EDU’s relationship with the chat structure (eg turns, dialogues).

class educe.stac.learning.features.SingleEduSubgroup_Parser

Bases: educe.stac.learning.features.SingleEduSubgroup

Single-EDU features that come out of a syntactic parser.

class educe.stac.learning.features.SingleEduSubgroup_Punct

Bases: educe.stac.learning.features.SingleEduSubgroup

punctuation features

class educe.stac.learning.features.SingleEduSubgroup_Token

Bases: educe.stac.learning.features.SingleEduSubgroup

word/token-based features

class educe.stac.learning.features.VerbNetEntry(classname, lemmas)

Bases: tuple

classname

Alias for field number 0

lemmas

Alias for field number 1

class educe.stac.learning.features.VerbNetLexKeyGroup(ventries)

Bases: educe.learning.keys.KeyGroup

One feature per VerbNet lexicon class

fill(current, edu, target=None)

See SingleEduSubgroup

classmethod key_prefix()

All feature keys in this lexicon should start with this string

mk_field(ventry)

From verb class to feature key

mk_fields()

Feature name for each relation in the lexicon

educe.stac.learning.features.clean_chat_word(token)

Given a word and its postag (educe PosTag representation) return a somewhat tidied up version of the word.

  • Sequences of the same letter greater than length 3 are shortened to just length three
  • Letter is lower cased
educe.stac.learning.features.clean_dialogue_act(act)

Knock out temporary markers used during corpus annotation

educe.stac.learning.features.dialogue_act_pairs(current, cache, edu1, edu2)

tuple of dialogue acts for both EDUs

educe.stac.learning.features.edu_position_in_turn(_, edu)

relative position of the EDU in the turn

educe.stac.learning.features.edu_text_feature(wrapped)

Lift a text based feature into a standard single EDU one

(String -> a) ->
((Current, Edu) -> a)
educe.stac.learning.features.emoticons(tokens)

Given some tokens, return just those which are emoticons

educe.stac.learning.features.enclosed_lemmas(span, parses)

Given a span and a list of parses, return any lemmas that are within that span

educe.stac.learning.features.enclosed_trees(span, trees)

Return the biggest (sub)trees in xs that are enclosed in the span

educe.stac.learning.features.ends_with_bang(current, edu)

if the EDU text ends with ‘!’

educe.stac.learning.features.ends_with_qmark(current, edu)

if the EDU text ends with ‘?’

educe.stac.learning.features.extract_pair_features(inputs, stage)

Extraction for all relevant pairs in a document (generator)

educe.stac.learning.features.extract_single_features(inputs, stage)

Return a dictionary for each EDU

educe.stac.learning.features.feat_annotator(current, edu1, edu2)

annotator for the subdoc

educe.stac.learning.features.feat_end(_, edu)

text span end

educe.stac.learning.features.feat_has_emoticons(_, edu)

if the EDU has emoticon-tagged tokens

educe.stac.learning.features.feat_id(_, edu)

some sort of unique identifier for the EDU

educe.stac.learning.features.feat_is_emoticon_only(_, edu)

if the EDU consists solely of an emoticon

educe.stac.learning.features.feat_start(_, edu)

text span start

educe.stac.learning.features.get_players(inputs)

Return a dictionary mapping each document to the set of players in that document

educe.stac.learning.features.has_FOR_np(current, edu)

if the EDU has the pattern IN(for).. NP

educe.stac.learning.features.has_correction_star(current, edu)

if the EDU begins with a ‘*’ but does not contain others

educe.stac.learning.features.has_inner_question(current, gap, _edu1, _edu2)

if there is an intervening EDU that is a question

educe.stac.learning.features.has_one_of_words(sought, tokens, norm=<function <lambda>>)

Given a set of words, a collection tokens, return True if the tokens contain words match one of the desired words, modulo some minor normalisations like lowercasing.

educe.stac.learning.features.has_pdtb_markers(markers, tokens)

Given a sequence of tagged tokens, return True if any of the given PDTB markers appears within the tokens

educe.stac.learning.features.has_player_name_exact(current, edu)

if the EDU text has a player name in it

educe.stac.learning.features.has_player_name_fuzzy(current, edu)

if the EDU has a word that sounds like a player name

educe.stac.learning.features.is_just_emoticon(tokens)

Return true if a sequence of tokens consists of a single emoticon

educe.stac.learning.features.is_nplike(anno)

is some sort of NP annotation from a parser

educe.stac.learning.features.is_question(current, edu)

if the EDU is (or contains) a question

educe.stac.learning.features.is_question_pairs(current, cache, edu1, edu2)

boolean tuple: if each EDU is a question

educe.stac.learning.features.lemma_subject(*args, **kwargs)

the lemma corresponding to the subject of this EDU

educe.stac.learning.features.lexical_markers(lclass, tokens)

Given a dictionary (words to categories) and a text span, return all the categories of words that appear in that set.

Note that for now we are doing our own white-space based tokenisation, but it could make sense to use a different source of tokens instead

educe.stac.learning.features.map_topdown(good, prunable, trees)

Do topdown search on all these trees, concatenate results.

educe.stac.learning.features.mk_env(inputs, people, key)

Pre-process and bundle up a representation of the current document

educe.stac.learning.features.mk_envs(inputs, stage)

Generate an environment for each document in the corpus within the given stage.

The environment pools together all the information we have on a single document

educe.stac.learning.features.mk_high_level_dialogues(inputs, stage)

Generate all relevant EDU pairs for a document (generator)

educe.stac.learning.features.mk_is_interesting(args, single)

Return a function that filters corpus keys to pick out the ones we specified on the command line

We have two cases here: for pair extraction, we just want to grab the units and if possible the discourse stage. In live mode, there won’t be a discourse stage, but that’s fine because we can just fall back on units.

For single extraction (dialogue acts), we’ll also want to grab the units stage and fall back to unannotated when in live mode. This is made a bit trickier by the fact that unannotated does not have an annotator, so we have to accomodate that.

Phew.

It’s a bit specific to feature extraction in that here we are trying

educe.stac.learning.features.num_edus_between(_current, gap, _edu1, _edu2)

number of intervening EDUs (0 if adjacent)

educe.stac.learning.features.num_nonling_tstars_between(_current, gap, _edu1, _edu2)

number of non-linguistic turn-stars between EDUs

educe.stac.learning.features.num_speakers_between(_current, gap, _edu1, _edu2)

number of distinct speakers in intervening EDUs

educe.stac.learning.features.num_tokens(_, edu)

length of this EDU in tokens

educe.stac.learning.features.player_addresees(edu)

The set of people spoken to during an edu annotation. This excludes known non-players, like ‘All’, or ‘?’, or ‘Please choose...’,

educe.stac.learning.features.players_for_doc(corpus, kdoc)

Return the set of speakers/addressees associated with a document.

In STAC, documents are semi-arbitrarily cut into sub-documents for technical and possibly ergonomic reasons, ie. meaningless as far as we are concerned. So to find all speakers, we would have to search all the subdocuments of a single document.

(Corpus, String) -> Set String
educe.stac.learning.features.position_in_dialogue(_, edu)

relative position of the turn in the dialogue

educe.stac.learning.features.position_in_game(_, edu)

relative position of the turn in the game

educe.stac.learning.features.position_of_speaker_first_turn(edu)

Given an EDU context, determine the position of the first turn by that EDU’s speaker relative to other turns in that dialogue.

educe.stac.learning.features.read_corpus_inputs(args)

Read and filter the part of the corpus we want features for

educe.stac.learning.features.read_pdtb_lexicon(args)

Read and return the local PDTB discourse marker lexicon.

educe.stac.learning.features.real_dialogue_act(edu)

Given an EDU in the ‘discourse’ stage of the corpus, return its dialogue act from the ‘units’ stage

educe.stac.learning.features.relation_dict(doc, quiet=False)

Return the relations instances from a document in the form of an id pair to label dictionary

If there is more than one relation between a pair of EDUs we pick one of them arbitrarily and ignore the other

educe.stac.learning.features.same_speaker(current, _, edu1, edu2)

if both EDUs have the same speaker

educe.stac.learning.features.same_turn(current, _, edu1, edu2)

if both EDUs are in the same turn

educe.stac.learning.features.speaker_already_spoken_in_dialogue(_, edu)

if the speaker for this EDU is the same as that of a previous turn in the dialogue

educe.stac.learning.features.speaker_id(_, edu)

Get the speaker ID

educe.stac.learning.features.speaker_started_the_dialogue(_, edu)

if the speaker for this EDU is the same as that of the first turn in the dialogue

educe.stac.learning.features.speakers_first_turn_in_dialogue(_, edu)

position in the dialogue of the turn in which the speaker for this EDU first spoke

educe.stac.learning.features.strip_cdus(corpus, mode)

For all documents in a corpus, remove any CDUs and relink the document according to the desired mode. This mutates the corpus.

educe.stac.learning.features.subject_lemmas(span, trees)

Given a span and a list of dependency trees, return any lemmas which are marked as being some subject in that span

educe.stac.learning.features.turn_follows_gap(_, edu)

if the EDU turn number is > 1 + previous turn

educe.stac.learning.features.type_text(wrapped)

Given a feature that emits text, clean its output up so to work with a wide variety of csv parsers

(a -> String) ->
(a -> String)
educe.stac.learning.features.word_first(*args, **kwargs)

the first word in this EDU

educe.stac.learning.features.word_last(*args, **kwargs)

the last word in this EDU