# allennlp.data.iterators¶

The various DataIterator subclasses can be used to iterate over datasets with different batching and padding schemes.

class allennlp.data.iterators.data_iterator.DataIterator[source]

An abstract DataIterator class. DataIterators must implement __call__, which yields batched examples.

default_implementation = 'bucket'
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.iterators.data_iterator.DataIterator[source]
get_num_batches(instances: typing.Iterable[allennlp.data.instance.Instance]) → int[source]

Returns the number of batches that dataset will be split into; if you want to track progress through the batch with the generator produced by __call__, this could be useful.

index_with(vocab: allennlp.data.vocabulary.Vocabulary)[source]
vocab = None
class allennlp.data.iterators.adaptive_iterator.AdaptiveIterator(adaptive_memory_usage_constant: float, padding_memory_scaling: typing.Callable[typing.Dict[str, typing.Dict[str, int]], float], maximum_batch_size: int = 10000, biggest_batch_first: bool = False, batch_size: int = None, sorting_keys: typing.List[typing.Tuple[str, str]] = None, padding_noise: float = 0.2, instances_per_epoch: int = None, max_instances_in_memory: int = None) → None[source]

An AdaptiveIterator is a DataIterator that varies the batch size to try to optimize GPU memory usage. Because padding lengths are done dynamically, we can have larger batches when padding lengths are smaller, maximizing our usage of the GPU. This is intended only for use with very large models that only barely fit on the GPU - if your model is small enough that you can easily fit a reasonable batch size on the GPU for your biggest instances, you probably should just use a BucketIterator. This is also still largely experimental, because it interacts with the learning rate in odd ways, and we haven’t yet implemented good algorithms to modify the learning rate based on batch size, etc.

In order for this to work correctly, you need to do two things:

1. Provide the padding_memory_scaling function, which gives a big-O bound on memory usage given padding lengths. For instance, if you have two TextFields with sentence_lengths which require padding, this might be simply |sentence1| * |sentence2|.
2. Tune the adaptive_memory_usage_constant parameter for your particular model and GPU. While tuning this, set biggest_batch_first to True, which will bypass the adaptive grouping step and use the batching of a BucketIterator, returning the biggest batch first. You want to find the largest batch size for which this largest batch actually fits on the GPU without running out of memory. TODO(mattg): make this happen automatically somehow.
Parameters: adaptive_memory_usage_constant : int, required. Only relevant if use_adaptive_grouping is True. This is a manually-tuned parameter, specific to a particular model architecture and amount of GPU memory (e.g., if you change the number of hidden layers in your model, this number will need to change). The recommended way to tune this parameter is to (1) use a fixed batch size, with biggest_batch_first set to True, and find out the maximum batch size you can handle on your biggest instances without running out of memory. Then (2) turn on use_adaptive_grouping, and set this parameter so that you get the right batch size for your biggest instances. If you set the log level to DEBUG in scripts/run_model.py, you can see the batch sizes that are computed. padding_memory_scaling: Callable[[Dict[str, Dict[str, int]]], float], required. This function is used for computing the adaptive batch sizes. We assume that memory usage is a function that looks like this: $$M = b * O(p) * c$$, where $$M$$ is the memory usage, $$b$$ is the batch size, $$c$$ is some constant that depends on how much GPU memory you have and various model hyperparameters, and $$O(p)$$ is a function outlining how memory usage asymptotically varies with the padding lengths. Our approach will be to let the user effectively set $$\frac{M}{c}$$ using the adaptive_memory_usage_constant above. This function specifies $$O(p)$$, so we can solve for the batch size $$b$$. The more specific you get in specifying $$O(p)$$ in this function, the better a job we can do in optimizing memory usage. maximum_batch_size : int, optional (default=10000) If we’re using adaptive batch sizes, you can use this to be sure you do not create batches larger than this, even if you have enough memory to handle it on your GPU. You might choose to do this to keep smaller batches because you like the noisier gradient estimates that come from smaller batches, for instance. biggest_batch_first : bool, optional (default=False) See BucketIterator. If this is True, we bypass the adaptive grouping step, so you can tune the adaptive_memory_usage_constant. batch_size : int, optional (default=None) Only used when biggest_batch_first is True, used for tuning adaptive_memory_usage_constant. sorting_keys : List[Tuple[str, str]] See BucketIterator. padding_noise : List[Tuple[str, str]] See BucketIterator. instances_per_epoch : int, optional, (default = None) See BasicIterator. max_instances_in_memory : int, optional, (default = None) See BasicIterator.
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.iterators.adaptive_iterator.AdaptiveIterator[source]
get_num_batches(instances: typing.Iterable[allennlp.data.instance.Instance]) → int[source]

This is a non-trivial operation with an AdaptiveIterator, and it’s only approximate, because the actual number of batches constructed depends on the padding noise. Call this sparingly.

class allennlp.data.iterators.basic_iterator.BasicIterator(batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None) → None[source]

A very basic iterator, which takes a dataset, creates fixed sized batches, and pads all of the instances in a batch to the maximum lengths of the relevant fields within that batch.

Parameters: batch_size : int, optional, (default = 32) The size of each batch of instances yielded when calling the iterator. instances_per_epoch : int, optional, (default = None) If specified, each epoch will consist of precisely this many instances. If not specified, each epoch will consist of a single pass through the dataset. max_instances_in_memory : int, optional, (default = None) If specified, the iterator will load this many instances at a time into an in-memory list and then produce batches from one such list at a time. This could be useful if your instances are read lazily from disk.
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.iterators.basic_iterator.BasicIterator[source]
get_num_batches(instances: typing.Iterable[allennlp.data.instance.Instance]) → int[source]

Returns the number of batches that dataset will be split into; if you want to track progress through the batch with the generator produced by __call__, this could be useful.

class allennlp.data.iterators.bucket_iterator.BucketIterator(sorting_keys: typing.List[typing.Tuple[str, str]], padding_noise: float = 0.1, biggest_batch_first: bool = False, batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None) → None[source]

An iterator which by default, pads batches with respect to the maximum input lengths per batch. Additionally, you can provide a list of field names and padding keys which the dataset will be sorted by before doing this batching, causing inputs with similar length to be batched together, making computation more efficient (as less time is wasted on padded elements of the batch).

Parameters: sorting_keys : List[Tuple[str, str]] To bucket inputs into batches, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. In order to do this, we need to know which fields need what type of padding, and in what order. For example, [("sentence1", "num_tokens"), ("sentence2", "num_tokens"), ("sentence1", "num_token_characters")] would sort a dataset first by the “num_tokens” of the “sentence1” field, then by the “num_tokens” of the “sentence2” field, and finally by the “num_token_characters” of the “sentence1” field. TODO(mattg): we should have some documentation somewhere that gives the standard padding keys used by different fields. padding_noise : float, optional (default=.1) When sorting by padding length, we add a bit of noise to the lengths, so that the sorting isn’t deterministic. This parameter determines how much noise we add, as a percentage of the actual padding value for each instance. biggest_batch_first : bool, optional (default=False) This is largely for testing, to see how large of a batch you can safely use with your GPU. This will let you try out the largest batch that you have in the data first, so that if you’re going to run out of memory, you know it early, instead of waiting through the whole epoch to find out at the end that you’re going to crash. Note that if you specify max_instances_in_memory, the first batch will only be the biggest from among the first “max instances in memory” instances. batch_size : int, optional, (default = 32) The size of each batch of instances yielded when calling the iterator. instances_per_epoch : int, optional, (default = None) See BasicIterator. max_instances_in_memory : int, optional, (default = None) See BasicIterator.
classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.iterators.bucket_iterator.BucketIterator[source]
class allennlp.data.iterators.epoch_tracking_bucket_iterator.EpochTrackingBucketIterator(sorting_keys: typing.List[typing.Tuple[str, str]], padding_noise: float = 0.1, biggest_batch_first: bool = False, batch_size: int = 32, instances_per_epoch: int = None, max_instances_in_memory: int = None) → None[source]

This is essentially a allennlp.data.iterators.BucketIterator with just one difference. It keeps track of the epoch number, and adds that as an additional meta field to each instance. That way, Model.forward will have access to this information. We do this by keeping track of epochs globally, and incrementing them whenever the iterator is called. However, the iterator is called both for training and validation sets. So, we keep a dict of epoch numbers, one key per dataset.

Parameters: See :class:BucketIterator.