allennlp.data.dataset_readers.multiprocess_dataset_reader

class allennlp.data.dataset_readers.multiprocess_dataset_reader.MultiprocessDatasetReader(base_reader: allennlp.data.dataset_readers.dataset_reader.DatasetReader, num_workers: int, epochs_per_read: int = 1, output_queue_size: int = 1000) → None[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Wraps another dataset reader and uses it to read from multiple input files using multiple processes. Note that in this case the file_path passed to read() should be a glob, and that the dataset reader will return instances from all files matching the glob.

Parameters:
base_reader : DatasetReader

Each process will use this dataset reader to read zero or more files.

num_workers : int

How many data-reading processes to run simultaneously.

epochs_per_read : int, (optional, default=1)

Normally a call to DatasetReader.read() returns a single epoch worth of instances, and your DataIterator handles iteration over multiple epochs. However, in the multiple-process case, it’s possible that you’d want finished workers to continue on to the next epoch even while others are still finishing the previous epoch. Passing in a value larger than 1 allows that to happen.

output_queue_size: ``int``, (optional, default=1000)

The size of the queue on which read instances are placed to be yielded. You might need to increase this if you’re generating instances too quickly.

read(file_path: str) → typing.Iterable[allennlp.data.instance.Instance][source]
text_to_instance(*args, **kwargs) → allennlp.data.instance.Instance[source]

Just delegate to the base reader text_to_instance.