MultiprocessDatasetReader(base_reader: allennlp.data.dataset_readers.dataset_reader.DatasetReader, num_workers: int, epochs_per_read: int = 1, output_queue_size: int = 1000)¶
Wraps another dataset reader and uses it to read from multiple input files using multiple processes. Note that in this case the
read()should be a glob, and that the dataset reader will return instances from all files matching the glob.
The order the files are processed in is a function of Numpy’s random state up to non-determinism caused by using multiple worker processes. This can be avoided by setting
Each process will use this dataset reader to read zero or more files.
How many data-reading processes to run simultaneously.
int, (optional, default=1)
Normally a call to
DatasetReader.read()returns a single epoch worth of instances, and your
DataIteratorhandles iteration over multiple epochs. However, in the multiple-process case, it’s possible that you’d want finished workers to continue on to the next epoch even while others are still finishing the previous epoch. Passing in a value larger than 1 allows that to happen.
- output_queue_size: ``int``, (optional, default=1000)
The size of the queue on which read instances are placed to be yielded. You might need to increase this if you’re generating instances too quickly.
read(self, file_path:str) → Iterable[allennlp.data.instance.Instance]¶
Iterablecontaining all the instances in the specified dataset.
self.lazyis False, this calls
self._read(), ensures that the result is a list, then returns the resulting list.
self.lazyis True, this returns an object whose
self._read()each iteration. In this case your implementation of
_read()must also be lazy (that is, not load all instances into memory at once), otherwise you will get a
In either case, the returned
Iterablecan be iterated over multiple times. It’s unlikely you want to override this function, but if you do your result should likewise be repeatedly iterable.
text_to_instance(self, *args, **kwargs) → allennlp.data.instance.Instance¶
Just delegate to the base reader text_to_instance.
multiprocessing.log_to_stderr causes some output in the logs even when we don’t use this dataset reader. This is a small hack to instantiate the stderr logger lazily only when it’s needed (which is only when using the MultiprocessDatasetReader)
info(message:str) → None¶