class typing.Dict[str,] = None) → None[source]


This DatasetReader is designed to read in the English OntoNotes v5.0 data in the format used by the CoNLL 2011/2012 shared tasks. In order to use this Reader, you must follow the instructions provided here (v12 release):, which will allow you to download the CoNLL style annotations for the OntoNotes v5.0 release – LDC2013T19.tgz obtained from LDC.

Once you have run the scripts on the extracted data, you will have a folder structured as follows:

── data
├── development
└── data
└── english
└── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb
├── test
└── data
└── english
└── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb
└── train
└── data
└── english
└── annotations
├── bc ├── bn ├── mz ├── nw ├── pt ├── tc └── wb

The file path provided to this class can then be any of the train, test or development directories(or the top level data directory, if you are not utilizing the splits).

The data has the following format, ordered by column.

1 Document ID : str
This is a variation on the document filename
2 Part number : int
Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
3 Word number : int
This is the word index of the word in that sentence.
4 Word : str
This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
5 POS Tag : str
This is the Penn Treebank style part of speech. When parse information is missing, all part of speeches except the one for which there is some sense or proposition annotation are marked with a XX tag. The verb is marked with just a VERB tag.
6 Parse bit: str
This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterisk with the “([pos] [word])” string (or leaf) and concatenating the items in the rows of that column. When the parse information is missing, the first word of a sentence is tagged as (TOP* and the last word is tagged as *) and all intermediate words are tagged with a *.
7 Predicate lemma: str
The predicate lemma is mentioned for the rows for which we have semantic role information or word sense information. All other rows are marked with a “-”.
8 Predicate Frameset ID: int
The PropBank frameset ID of the predicate in Column 7.
9 Word sense: float
This is the word sense of the word in Column 3.
10 Speaker/Author: str
This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. When not available the rows are marked with an “-”.
11 Named Entities: str
These columns identifies the spans representing various named entities. For documents which do not have named entity annotation, each line is represented with an *.
12+ Predicate Arguments: str
There is one column each of predicate argument structure information for the predicate mentioned in Column 7. If there are no predicates tagged in a sentence this is a single column with all rows marked with an *.
-1 Co-reference: str
Co-reference chain information encoded in a parenthesis structure. For documents that do
not have co-reference annotations, each line is represented with a “-”.

token_indexers : Dict[str, TokenIndexer], optional

We similarly use this for both the premise and the hypothesis. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.


A Dataset of Instances for Semantic Role Labelling.

classmethod from_params(params: allennlp.common.params.Params) →[source]
read(file_path: str)[source]

Actually reads some data from the file_path and returns a Dataset.

text_to_instance(tokens: typing.List[], verb_label: typing.List[int], tags: typing.List[str] = None) →[source]

We take pre-tokenized input here, along with a verb label. The verb label should be a one-hot binary vector, the same length as the tokens, indicating the position of the verb to find arguments for.