educe.pdtb package

Conventions specific to the Penn Discourse Treebank (PDTB) project

Submodules

educe.pdtb.corpus module

PDTB Corpus management (re-exported by educe.pdtb)

class educe.pdtb.corpus.Reader(corpusdir)

Bases: educe.corpus.Reader

See educe.corpus.Reader for details

files(doc_glob=None)
Parameters:doc_glob (str, optional) – Glob expression for document (folder) names ; if None, it uses the wildcard ‘/‘ for folder names and file basenames.
slurp_subcorpus(cfiles, verbose=False)

See educe.rst_dt.parse for a description of RSTTree

educe.pdtb.corpus.id_to_path(k)

Given a fleshed out FileId (none of the fields are None), return a filepath for it following Penn Discourse Treebank conventions.

You will likely want to add your own filename extensions to this path

educe.pdtb.corpus.mk_key(doc)

Return an corpus key for a given document name

educe.pdtb.parse module

Standalone parser for PDTB files.

The function parse takes a single .pdtb file and returns a list of Relation, with the following subtypes:

Relation selection features sup?
ExplicitRelation Selection attr, 1 connhead Y
ImplicitRelation InferenceSite attr, 2 conn Y
AltLexRelation Selection attr, 2 semclass Y
EntityRelation InferenceSite none N
NoRelation InferenceSite none N

These relation subtypes are stitched together (and inherit members) from two or three components

  • arguments: always arg1 and arg2; but in some cases, the arguments can have supplementary information
  • selection: see either Selection or InferenceSite
  • some features (see eg. ExplictRelationFeatures)

The simplest way to get to grips with this may be to try the parse function on some sample relations and print the resulting objects.

class educe.pdtb.parse.AltLexRelation(selection, features, args)

Bases: educe.pdtb.parse.Selection, educe.pdtb.parse.AltLexRelationFeatures, educe.pdtb.parse.Relation

class educe.pdtb.parse.AltLexRelationFeatures(attribution, semclass1, semclass2)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.Arg(selection, attribution=None, sup=None)

Bases: educe.pdtb.parse.Selection

class educe.pdtb.parse.Attribution(source, type, polarity, determinacy, selection=None)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.Connective(text, semclass1, semclass2=None)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.EntityRelation(infsite, args)

Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.Relation

class educe.pdtb.parse.ExplicitRelation(selection, features, args)

Bases: educe.pdtb.parse.Selection, educe.pdtb.parse.ExplicitRelationFeatures, educe.pdtb.parse.Relation

class educe.pdtb.parse.ExplicitRelationFeatures(attribution, connhead)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.GornAddress(parts)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.ImplicitRelation(infsite, features, args)

Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.ImplicitRelationFeatures, educe.pdtb.parse.Relation

class educe.pdtb.parse.ImplicitRelationFeatures(attribution, connective1, connective2=None)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.InferenceSite(strpos, sentnum)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.NoRelation(infsite, args)

Bases: educe.pdtb.parse.InferenceSite, educe.pdtb.parse.Relation

class educe.pdtb.parse.PdtbItem

Bases: object

class educe.pdtb.parse.Relation(args)

Bases: educe.pdtb.parse.PdtbItem

arg1

TODO – TODO

arg2

TODO – TODO

class educe.pdtb.parse.Selection(span, gorn, text)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.SemClass(klass)

Bases: educe.pdtb.parse.PdtbItem

class educe.pdtb.parse.Sup(selection)

Bases: educe.pdtb.parse.Selection

educe.pdtb.parse.parse(path)

Retrieve the list of relations found in a single .pdtb file.

Parameters:path (str) – Path to the .pdtb file (?)
Returns:relations – List of relations found.
Return type:list of Relation
educe.pdtb.parse.parse_relation(s)

Parse a single relation or throw a ParseException.

educe.pdtb.parse.split_relations(s)

educe.pdtb.pdtbx module

PDTB in an adhoc (educe-grown) XML format, unfortunately not a standard, but a little homegrown language using XML syntax. I’ll call it pdtbx. No reason it can’t be used outside of educe.

Informal DTD:

  • SpanList is attribute spanList in PDTB string convention
  • GornAddressList is attribute gornList in PDTB string convention
  • SemClass is attribute semclass1 (and optional attribute semclass2)
    in PDTB string convention
  • text in <text> elements with usual XML escaping conventions
  • args in <arg> elements in order (arg1 before arg2)
  • implicitRelations can have multiple connectives
educe.pdtb.pdtbx.Relation_xml(itm)
educe.pdtb.pdtbx.Relations_xml(itms)
educe.pdtb.pdtbx.read_Relation(node)
educe.pdtb.pdtbx.read_Relations(node)
educe.pdtb.pdtbx.read_pdtbx_file(filename)
educe.pdtb.pdtbx.write_pdtbx_file(filename, relations)

educe.pdtb.ptb module

Alignment with the Penn Treebank

educe.pdtb.ptb.parse_trees(corpus, k, ptb)

Given an PDTB document and an NLTK PTB reader, return the PTB trees.

Note that a future version of this function will try to educify the trees as well, but for now things will be fairly rudimentary

educe.pdtb.ptb.reader(corpus_dir)

An instantiated NLTK BracketedParseCorpusReader for the PTB section relevant to the PDTB corpus.

Note that the path you give to this will probably end with something like parsed/mrg/wsj