allennlp.data.dataset

A Dataset represents a collection of data suitable for feeding into a model. For example, when you train a model, you will likely have a training dataset and a validation dataset.

class allennlp.data.dataset.Dataset(instances: typing.List[allennlp.data.instance.Instance]) → None[source]

Bases: object

A collection of Instance objects. The Instances have Fields, and the fields could be in an indexed or unindexed state - the Dataset has methods around indexing the data and converting the data into arrays.

as_array_dict(padding_lengths: typing.Dict[str, typing.Dict[str, int]] = None, verbose: bool = True) → typing.Dict[str, typing.Union[numpy.ndarray, typing.Dict[str, numpy.ndarray]]][source]

This method converts this Dataset into a set of numpy arrays that can be passed through a model. In order for the numpy arrays to be valid arrays, all Instances in this dataset need to be padded to the same lengths wherever padding is necessary, so we do that first, then we combine all of the arrays for each field in each instance into a set of batched arrays for each field.

Parameters:

padding_lengths : Dict[str, Dict[str, int]]

If a key is present in this dictionary with a non-None value, we will pad to that length instead of the length calculated from the data. This lets you, e.g., set a maximum value for sentence length if you want to throw out long sequences.

Entries in this dictionary are keyed first by field name (e.g., “question”), then by padding key (e.g., “num_tokens”).

verbose : bool, optional (default=``True``)

Should we output logging information when we’re doing this padding? If the dataset is large, this is nice to have, because padding a large dataset could take a long time. But if you’re doing this inside of a data generator, having all of this output per batch is a bit obnoxious.

Returns:

data_arrays : Dict[str, DataArray]

A dictionary of data arrays, keyed by field name, suitable for passing as input to a model. This is a batch of instances, so, e.g., if the instances have a “question” field and an “answer” field, the “question” fields for all of the instances will be grouped together into a single array, and the “answer” fields for all instances will be similarly grouped in a parallel set of arrays, for batched computation. Additionally, for TextFields, the value of the dictionary key is no longer a single array, but another dictionary mapping TokenIndexer keys to arrays. The number of elements in this sub-dictionary therefore corresponds to the number of TokenIndexers used to index the Field.

get_padding_lengths() → typing.Dict[str, typing.Dict[str, int]][source]

Gets the maximum padding lengths from all Instances in this dataset. Each Instance has multiple Fields, and each Field could have multiple things that need padding. We look at all fields in all instances, and find the max values for each (field_name, padding_key) pair, returning them in a dictionary.

This can then be used to convert this dataset into arrays of consistent length, or to set model parameters, etc.

index_instances(vocab: allennlp.data.vocabulary.Vocabulary)[source]

Converts all UnindexedFields in all Instances in this Dataset into IndexedFields. This modifies the current object, it does not return a new object.

truncate(max_instances: int)[source]

If there are more instances than max_instances in this dataset, we truncate the instances to the first max_instances. This modifies the current object, and returns nothing.