allennlp.data.fields

A Field is some piece of data instance that ends up as an array in a model.

class allennlp.data.fields.field.Field[source]

Bases: typing.Generic

A Field is some piece of a data instance that ends up as an array in a model (either as an input or an output). Data instances are just collections of fields.

Fields go through up to two steps of processing: (1) tokenized fields are converted into token ids, (2) fields containing token ids (or any other numeric data) are padded (if necessary) and converted into data arrays. The Field API has methods around both of these steps, though they may not be needed for some concrete Field classes - if your field doesn’t have any strings that need indexing, you don’t need to implement count_vocab_items or index. These methods pass by default.

Once a vocabulary is computed and all fields are indexed, we will determine padding lengths, then intelligently batch together instances and pad them into actual arrays.

as_array(padding_lengths: typing.Dict[str, int]) → DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

classmethod batch_arrays(array_list: typing.List[DataArray]) → DataArray[source]

Takes the output of Field.as_array() from a list of Instances and merges it into one batched array for this Field. The default implementation here in the base class handles cases where as_array returns a single numpy array per instance, or a dictionary of single arrays. If your subclass returns something other than this, you need to override this method.

count_vocab_items(counter: typing.Dict[str, typing.Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field() → allennlp.data.fields.field.Field[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.index_field.IndexField(index: int, sequence_field: allennlp.data.fields.sequence_field.SequenceField) → None[source]

Bases: allennlp.data.fields.field.Field

An IndexField is an index into a SequenceField, as might be used for representing a correct answer option in a list, or a span begin and span end position in a passage, for example. Because it’s an index into a SequenceField, we take one of those as input and use it to compute padding lengths.

Parameters:

index : int

The index of the answer in the SequenceField. This is typically the “correct answer” in some classification decision over the sequence, like where an answer span starts in SQuAD, or which answer option is correct in a multiple choice question. A value of -1 means there is no label, which can be used for padding or other purposes.

sequence_field : SequenceField

A field containing the sequence that this IndexField is a pointer into.

as_array(padding_lengths: typing.Dict[str, int]) → <built-in function array>[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

empty_field()[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.label_field.LabelField(label: typing.Union[str, int], label_namespace: str = 'labels', skip_indexing: bool = False) → None[source]

Bases: allennlp.data.fields.field.Field

A LabelField is a categorical label of some kind, where the labels are either strings of text or 0-indexed integers (if you wish to skip indexing by passing skip_indexing=True). If the labels need indexing, we will use a Vocabulary to convert the string labels into integers.

This field will get converted into an integer index representing the class label.

Parameters:

label : Union[str, int]

label_namespace : str, optional (default=”labels”)

The namespace to use for converting label strings into integers. We map label strings to integers for you (e.g., “entailment” and “contradiction” get converted to 0, 1, ...), and this namespace tells the Vocabulary object which mapping from strings to integers to use (so “entailment” as a label doesn’t get the same integer id as “entailment” as a word). If you have multiple different label fields in your data, you should make sure you use different namespaces for each one, always using the suffix “labels” (e.g., “passage_labels” and “question_labels”).

skip_indexing : bool, optional (default=False)

If your labels are 0-indexed integers, you can pass in this flag, and we’ll skip the indexing step. If this is False and your labels are not strings, this throws a ConfigurationError.

as_array(padding_lengths: typing.Dict[str, int]) → numpy.ndarray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

count_vocab_items(counter: typing.Dict[str, typing.Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field()[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

class allennlp.data.fields.list_field.ListField(field_list: typing.List[allennlp.data.fields.field.Field]) → None[source]

Bases: allennlp.data.fields.sequence_field.SequenceField

A ListField is a list of other fields. You would use this to represent, e.g., a list of answer options that are themselves TextFields.

This field will get converted into a tensor that has one more mode than the items in the list. If this is a list of TextFields that have shape (num_words, num_characters), this ListField will output a tensor of shape (num_sentences, num_words, num_characters).

Parameters:

field_list : List[Field]

A list of Field objects to be concatenated into a single input tensor. All of the contained Field objects must be of the same type.

as_array(padding_lengths: typing.Dict[str, int]) → DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

count_vocab_items(counter: typing.Dict[str, typing.Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field()[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

sequence_length() → int[source]

How many elements are there in this sequence?

class allennlp.data.fields.metadata_field.MetadataField(metadata: typing.Any) → None[source]

Bases: allennlp.data.fields.field.Field

A MetadataField is a Field that does not get converted into arrays. It just carries side information that might be needed later on, for computing some third-party metric, or outputting debugging information, or whatever else you need. We use this in the BiDAF model, for instance, to keep track of question IDs and passage token offsets, so we can more easily use the official evaluation script to compute metrics.

We don’t try to do any kind of smart combination of this field for batched input - when you use this Field in a model, you’ll get a list of metadata objects, one for each instance in the batch.

Note that if you use this field, you are required to include metadata in the field name used as a key in Instance. Otherwise we won’t know to treat the output of this field specially in arrays_to_variables().

Parameters:

metadata : Any

Some object containing the metadata that you want to store. It’s likely that you’ll want this to be a dictionary, but it could be anything you want.

as_array(padding_lengths: typing.Dict[str, int]) → DataArray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

classmethod batch_arrays(array_list: typing.List[DataArray]) → DataArray[source]

Takes the output of Field.as_array() from a list of Instances and merges it into one batched array for this Field. The default implementation here in the base class handles cases where as_array returns a single numpy array per instance, or a dictionary of single arrays. If your subclass returns something other than this, you need to override this method.

empty_field() → allennlp.data.fields.metadata_field.MetadataField[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

class allennlp.data.fields.sequence_field.SequenceField[source]

Bases: allennlp.data.fields.field.Field

A SequenceField represents a sequence of things. This class just adds a method onto Field: sequence_length(). It exists so that SequenceLabelField, IndexField and other similar Fields can have a single type to require, with a consistent API, whether they are pointing to words in a TextField, items in a ListField, or something else.

sequence_length() → int[source]

How many elements are there in this sequence?

class allennlp.data.fields.sequence_label_field.SequenceLabelField(labels: typing.Union[typing.List[str], typing.List[int]], sequence_field: allennlp.data.fields.sequence_field.SequenceField, label_namespace: str = 'labels') → None[source]

Bases: allennlp.data.fields.field.Field

A SequenceLabelField assigns a categorical label to each element in a SequenceField. Because it’s a labeling of some other field, we take that field as input here, and we use it to determine our padding and other things.

This field will get converted into a list of integer class ids, representing the correct class for each element in the sequence.

Parameters:

labels : Union[List[str], List[int]]

A sequence of categorical labels, encoded as strings or integers. These could be POS tags like [NN, JJ, ...], BIO tags like [B-PERS, I-PERS, O, O, ...], or any other categorical tag sequence. If the labels are encoded as integers, they will not be indexed using a vocab.

sequence_field : SequenceField

A field containing the sequence that this SequenceLabelField is labeling. Most often, this is a TextField, for tagging individual tokens in a sentence.

label_namespace : str, optional (default=’labels’)

The namespace to use for converting tag strings into integers. We convert tag strings to integers for you, and this parameter tells the Vocabulary object which mapping from strings to integers to use (so that “O” as a tag doesn’t get the same id as “O” as a word).

as_array(padding_lengths: typing.Dict[str, int]) → numpy.ndarray[source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

count_vocab_items(counter: typing.Dict[str, typing.Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field()[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A TextField represents a string of text, the kind that you might want to represent with standard word vectors, or pass through an LSTM.

class allennlp.data.fields.text_field.TextField(tokens: typing.List[allennlp.data.tokenizers.token.Token], token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer]) → None[source]

Bases: allennlp.data.fields.sequence_field.SequenceField

This Field represents a list of string tokens. Before constructing this object, you need to tokenize raw strings using a Tokenizer.

Because string tokens can be represented as indexed arrays in a number of ways, we also take a dictionary of TokenIndexer objects that will be used to convert the tokens into indices. Each TokenIndexer could represent each token as a single ID, or a list of character IDs, or something else.

This field will get converted into a dictionary of arrays, one for each TokenIndexer. A SingleIdTokenIndexer produces an array of shape (num_tokens,), while a TokenCharactersIndexer produces an array of shape (num_tokens, num_characters).

as_array(padding_lengths: typing.Dict[str, int]) → typing.Dict[str, numpy.ndarray][source]

Given a set of specified padding lengths, actually pad the data in this field and return a numpy array of the correct shape. This actually returns a list instead of a single array, in case there are several related arrays for this field (e.g., a TextField might have a word array and a characters-per-word array).

count_vocab_items(counter: typing.Dict[str, typing.Dict[str, int]])[source]

If there are strings in this field that need to be converted into integers through a Vocabulary, here is where we count them, to determine which tokens are in or out of the vocabulary.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

A note on this counter: because Fields can represent conceptually different things, we separate the vocabulary items by namespaces. This way, we can use a single shared mechanism to handle all mappings from strings to integers in all fields, while keeping words in a TextField from sharing the same ids with labels in a LabelField (e.g., “entailment” or “contradiction” are labels in an entailment task)

Additionally, a single Field might want to use multiple namespaces - TextFields can be represented as a combination of word ids and character ids, and you don’t want words and characters to share the same vocabulary - “a” as a word should get a different id from “a” as a character, and the vocabulary sizes of words and characters are very different.

Because of this, the first key in the counter object is a namespace, like “tokens”, “token_characters”, “tags”, or “labels”, and the second key is the actual vocabulary item.

empty_field()[source]

So that ListField can pad the number of fields in a list (e.g., the number of answer option TextFields), we need a representation of an empty field of each type. This returns that. This will only ever be called when we’re to the point of calling as_array(), so you don’t need to worry about get_padding_lengths, count_vocab_items, etc., being called on this empty field.

We make this an instance method instead of a static method so that if there is any state in the Field, we can copy it over (e.g., the token indexers in TextField).

get_padding_lengths() → typing.Dict[str, int][source]

If there are things in this field that need padding, note them here. In order to pad a batch of instance, we get all of the lengths from the batch, take the max, and pad everything to that length (or use a pre-specified maximum length). The return value is a dictionary mapping keys to lengths, like {‘num_tokens’: 13}.

This is always called after index().

index(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Given a Vocabulary, converts all strings in this field into (typically) integers. This modifies the Field object, it does not return anything.

If your Field does not have any strings that need to be converted into indices, you do not need to implement this method.

sequence_length() → int[source]

How many elements are there in this sequence?