allennlp.data.dataset_readers.conll2003

class allennlp.data.dataset_readers.conll2003.Conll2003DatasetReader(token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, tag_label: str = 'ner', feature_labels: typing.Sequence[str] = ()) → None[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads instances from a pretokenised file where each line is in the following format:

WORD POS-TAG CHUNK-TAG NER-TAG

with a blank line indicating the end of each sentence and ‘-DOCSTART- -X- -X- O’ indicating the end of each article, and converts it into a Dataset suitable for sequence tagging.

Each Instance contains the words in the "tokens" TextField. The values corresponding to the tag_label values will get loaded into the "tags" SequenceLabelField. And if you specify any feature_labels (you probably shouldn’t), the corresponding values will get loaded into their own SequenceLabelField s.

This dataset reader ignores the “article” divisions and simply treats each sentence as an independent Instance. (Technically the reader splits sentences on any combination of blank lines and “DOCSTART” tags; in particular, it does the right thing on well formed inputs.)

Parameters:

token_indexers : Dict[str, TokenIndexer], optional (default=``{“tokens”: SingleIdTokenIndexer()}``)

We use this to define the input representation for the text. See TokenIndexer.

tag_label: ``str``, optional (default=``ner``)

Specify ner, pos, or chunk to have that tag loaded into the instance field tag.

feature_labels: ``Sequence[str]``, optional (default=``()``)

These labels will be loaded as features into the corresponding instance fields: pos -> pos_tags, chunk -> chunk_tags, ner -> ner_tags Each will have its own namespace: pos_labels, chunk_labels, ner_labels. If you want to use one of the labels as a feature in your model, it should be specified here.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.dataset_readers.conll2003.Conll2003DatasetReader[source]
read(file_path)[source]

Actually reads some data from the file_path and returns a Dataset.

text_to_instance(tokens: typing.List[allennlp.data.tokenizers.token.Token]) → allennlp.data.instance.Instance[source]

We take pre-tokenized input here, because we don’t have a tokenizer in this class.