allennlp.data.iterators.bucket_iterator#

BucketIterator#

BucketIterator(self, sorting_keys:List[Tuple[str, str]]=None, padding_noise:float=0.1, biggest_batch_first:bool=False, batch_size:int=32, instances_per_epoch:int=None, max_instances_in_memory:int=None, cache_instances:bool=False, track_epoch:bool=False, maximum_samples_per_batch:Tuple[str, int]=None, skip_smaller_batches:bool=False) -> None

An iterator which by default, pads batches with respect to the maximum input lengths per batch. Additionally, you can provide a list of field names and padding keys which the dataset will be sorted by before doing this batching, causing inputs with similar length to be batched together, making computation more efficient (as less time is wasted on padded elements of the batch).

Parameters

  • sorting_keys : List[Tuple[str, str]], optional To bucket inputs into batches, we want to group the instances by padding length, so that we minimize the amount of padding necessary per batch. In order to do this, we need to know which fields need what type of padding, and in what order.

    Specifying the right keys for this is a bit cryptic, so if this is not given we try to auto-detect the right keys by iterating once through the data up front, reading all of the padding keys and seeing which one has the longest length. We use that one for padding. This should give reasonable results in most cases.

    When you need to specify this yourself, you can create an instance from your dataset and call Instance.get_padding_lengths() to see a list of all keys used in your data. You should give one or more of those as the sorting keys here. - padding_noise : float, optional (default=.1) When sorting by padding length, we add a bit of noise to the lengths, so that the sorting isn't deterministic. This parameter determines how much noise we add, as a percentage of the actual padding value for each instance. - biggest_batch_first : bool, optional (default=False) This is largely for testing, to see how large of a batch you can safely use with your GPU. This will let you try out the largest batch that you have in the data first, so that if you're going to run out of memory, you know it early, instead of waiting through the whole epoch to find out at the end that you're going to crash.

    Note that if you specify max_instances_in_memory, the first batch will only be the biggest from among the first "max instances in memory" instances. - batch_size : int, optional, (default = 32) The size of each batch of instances yielded when calling the iterator. - instances_per_epoch : int, optional, (default = None) - See :class:BasicIterator. - max_instances_in_memory : int, optional, (default = None) - See :class:BasicIterator. - maximum_samples_per_batch : Tuple[str, int], (default = None) - See :class:BasicIterator. - skip_smaller_batches : bool, optional, (default = False) When the number of data samples is not dividable by batch_size, some batches might be smaller than batch_size. If set to True, those smaller batches will be discarded.