educe.stac.lexicon package

Submodules

educe.stac.lexicon.markers module

API on discourse markers (lexicon I/O mostly)

class educe.stac.lexicon.markers.LexConn(infile, version='2', stop=set([u'xe0', u'ou', u'en', u'pour', u'et']))
get_by_form(form)
get_by_id(id)
get_by_lemma(lemma)
class educe.stac.lexicon.markers.Marker(elmt, version='2', stop=set([u'xe0', u'ou', u'en', u'pour', u'et']))

wrapper class for discourse marker read from Lexconn, version 1 or 2

should include at least id, cat (grammatical category) version 1 has type (coord/subord) version 2 has grammatical host and lemma

get_forms()
get_lemma()
get_relations()

educe.stac.lexicon.pdtb_markers module

Lexicon of discourse markers.

Cheap and cheerful phrasal lexicon format used in the STAC project. Maps sequences of multiword expressions to relations they mark

as ; explanation explanation* background as a result ; result result* for example ; elaboration if:then ; conditional on the one hand:on the other hand

One entry per line. Sometimes you have split expressions, like “on the one hand X, on the other hand Y” (we model this by saying that we are working with sequences of expressions, rather than single expressions). Phrases can be associated with 0 to N relations (interpreted as disjunction; if wedge appears (LaTeX for the “logical and” operator), it is ignored).

class educe.stac.lexicon.pdtb_markers.Marker(exprs)

Bases: object

A marker here is a sort of template consisting of multiword expressions and holes, eg. “on the one hand, XXX, on the other hand YYY”. We represent this is as a sequence of Multiword

classmethod any_appears_in(markers, words, sep='#####')

Return True if any of the given markers appears in the word sequence.

See appears_in for details.

appears_in(words, sep='#####')

Given a sequence of words, return True if this marker appears in that sequence.

We use a very liberal defintion here. In particular, if the marker has more than component (on the one hand X, on the other hand Y), we merely check that all components appear without caring what order they appear in.

Note that this abuses the Python string matching functionality, and assumes that the separator substring never appears in the tokens

class educe.stac.lexicon.pdtb_markers.Multiword(words)

Bases: object

A sequence of tokens representing a multiword expression.

educe.stac.lexicon.pdtb_markers.load_pdtb_markers_lexicon(filename)

Load the lexicon of discourse markers from the PDTB.

Parameters:filename (str) – Path to the lexicon.
Returns:markers – Discourse markers and the relations they signal
Return type:dict(Marker, list(string))
educe.stac.lexicon.pdtb_markers.read_lexicon(filename)

Load the lexicon of discourse markers from the PDTB, by relation.

This calls load_pdtb_markers_lexicon but inverts the indexing to map each relation to its possible discourse markers.

Note that, as an effect of this inversion, discourse markers whose set of relations is left empty in the lexicon (possibly because they are too ambiguous?) are absent from the inverted index.

Parameters:filename (str) – Path to the lexicon.
Returns:relations – Relations and their signalling discourse markers.
Return type:dict(string, frozenset(Marker))

educe.stac.lexicon.wordclass module

Cheap and cheerful lexicon format used in the STAC project. One entry per line, blanks ignored. Each entry associates

  • some word with
  • some kind of category (we call this a “lexical class”)
  • an optional part of speech (?? if unknown)
  • an optional subcategory blank if none

Here’s an example with all four fields

purchase:VBEchange:VB:receivable acquire:VBEchange:VB:receivable give:VBEchange:VB:givable

and one without the notion of subclass

ought:modal:MD: except:negation:??:
class educe.stac.lexicon.wordclass.LexClass

Bases: educe.stac.lexicon.wordclass.LexClass

Grouping together information for a single lexical class. Our assumption here is that a word belongs to at most one subclass

classmethod freeze(other)

A frozen copy of a lex class

just_subclasses()

Any subclasses associated with this lexical class

just_words()

Any words associated with this lexical class

classmethod new_writable_instance()

A brand new (empty) lex class

class educe.stac.lexicon.wordclass.LexEntry

Bases: educe.stac.lexicon.wordclass.LexEntry

a single entry in the lexicon

classmethod read_entries(items)

Return a list of LexEntry given an iterable of entry strings, eg. the stream for the lines in a file. Blank entries are ignored

classmethod read_entry(line)

Return a LexEntry given the string corresponding to an entry, or raise an exception if we can’t parse it

class educe.stac.lexicon.wordclass.Lexicon

Bases: educe.stac.lexicon.wordclass.Lexicon

All entries in a wordclass lexicon along with some helpers for convenient access

Parameters:
  • word_to_subclass (Dict String (Dict String String)) – class to word to subclass nested dict
  • subclasses_to_words (Dict String (Set String)) – class to subclass (to words)
dump()

Print a lexicon’s contents to stdout

classmethod read_file(filename)

Read the lexical entries in the file of the given name and return a Lexicon

:: FilePath -> IO Lexicon