educe.stac.lexicon package¶
Submodules¶
educe.stac.lexicon.markers module¶
API on discourse markers (lexicon I/O mostly)
-
class
educe.stac.lexicon.markers.
LexConn
(infile, version='2', stop=set([u'xe0', u'ou', u'en', u'pour', u'et']))¶ -
get_by_form
(form)¶
-
get_by_id
(id)¶
-
get_by_lemma
(lemma)¶
-
-
class
educe.stac.lexicon.markers.
Marker
(elmt, version='2', stop=set([u'xe0', u'ou', u'en', u'pour', u'et']))¶ wrapper class for discourse marker read from Lexconn, version 1 or 2
should include at least id, cat (grammatical category) version 1 has type (coord/subord) version 2 has grammatical host and lemma
-
get_forms
()¶
-
get_lemma
()¶
-
get_relations
()¶
-
educe.stac.lexicon.pdtb_markers module¶
Lexicon of discourse markers.
Cheap and cheerful phrasal lexicon format used in the STAC project. Maps sequences of multiword expressions to relations they mark
as ; explanation explanation* background as a result ; result result* for example ; elaboration if:then ; conditional on the one hand:on the other hand
One entry per line. Sometimes you have split expressions, like “on the one hand X, on the other hand Y” (we model this by saying that we are working with sequences of expressions, rather than single expressions). Phrases can be associated with 0 to N relations (interpreted as disjunction; if wedge appears (LaTeX for the “logical and” operator), it is ignored).
-
class
educe.stac.lexicon.pdtb_markers.
Marker
(exprs)¶ Bases:
object
A marker here is a sort of template consisting of multiword expressions and holes, eg. “on the one hand, XXX, on the other hand YYY”. We represent this is as a sequence of Multiword
-
classmethod
any_appears_in
(markers, words, sep='#####')¶ Return True if any of the given markers appears in the word sequence.
See appears_in for details.
-
appears_in
(words, sep='#####')¶ Given a sequence of words, return True if this marker appears in that sequence.
We use a very liberal defintion here. In particular, if the marker has more than component (on the one hand X, on the other hand Y), we merely check that all components appear without caring what order they appear in.
Note that this abuses the Python string matching functionality, and assumes that the separator substring never appears in the tokens
-
classmethod
-
class
educe.stac.lexicon.pdtb_markers.
Multiword
(words)¶ Bases:
object
A sequence of tokens representing a multiword expression.
-
educe.stac.lexicon.pdtb_markers.
load_pdtb_markers_lexicon
(filename)¶ Load the lexicon of discourse markers from the PDTB.
Parameters: filename (str) – Path to the lexicon. Returns: markers – Discourse markers and the relations they signal Return type: dict(Marker, list(string))
-
educe.stac.lexicon.pdtb_markers.
read_lexicon
(filename)¶ Load the lexicon of discourse markers from the PDTB, by relation.
This calls load_pdtb_markers_lexicon but inverts the indexing to map each relation to its possible discourse markers.
Note that, as an effect of this inversion, discourse markers whose set of relations is left empty in the lexicon (possibly because they are too ambiguous?) are absent from the inverted index.
Parameters: filename (str) – Path to the lexicon. Returns: relations – Relations and their signalling discourse markers. Return type: dict(string, frozenset(Marker))
educe.stac.lexicon.wordclass module¶
Cheap and cheerful lexicon format used in the STAC project. One entry per line, blanks ignored. Each entry associates
- some word with
- some kind of category (we call this a “lexical class”)
- an optional part of speech (?? if unknown)
- an optional subcategory blank if none
Here’s an example with all four fields
purchase:VBEchange:VB:receivable acquire:VBEchange:VB:receivable give:VBEchange:VB:givable
and one without the notion of subclass
ought:modal:MD: except:negation:??:
-
class
educe.stac.lexicon.wordclass.
LexClass
¶ Bases:
educe.stac.lexicon.wordclass.LexClass
Grouping together information for a single lexical class. Our assumption here is that a word belongs to at most one subclass
-
classmethod
freeze
(other)¶ A frozen copy of a lex class
-
just_subclasses
()¶ Any subclasses associated with this lexical class
-
just_words
()¶ Any words associated with this lexical class
-
classmethod
new_writable_instance
()¶ A brand new (empty) lex class
-
classmethod
-
class
educe.stac.lexicon.wordclass.
LexEntry
¶ Bases:
educe.stac.lexicon.wordclass.LexEntry
a single entry in the lexicon
-
classmethod
read_entries
(items)¶ Return a list of LexEntry given an iterable of entry strings, eg. the stream for the lines in a file. Blank entries are ignored
-
classmethod
read_entry
(line)¶ Return a LexEntry given the string corresponding to an entry, or raise an exception if we can’t parse it
-
classmethod
-
class
educe.stac.lexicon.wordclass.
Lexicon
¶ Bases:
educe.stac.lexicon.wordclass.Lexicon
All entries in a wordclass lexicon along with some helpers for convenient access
Parameters: - word_to_subclass (Dict String (Dict String String)) – class to word to subclass nested dict
- subclasses_to_words (Dict String (Set String)) – class to subclass (to words)
-
dump
()¶ Print a lexicon’s contents to stdout
-
classmethod
read_file
(filename)¶ Read the lexical entries in the file of the given name and return a Lexicon
:: FilePath -> IO Lexicon