allennlp.data.dataset_readers.seq2seq

class allennlp.data.dataset_readers.seq2seq.Seq2SeqDatasetReader(source_tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, target_tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, source_token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, target_token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None) → None[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Read a tsv file containing paired sequences, and create a dataset suitable for a SimpleSeq2Seq model, or any model with a matching API.

Expected format for each input line: <source_sequence_string> <target_sequence_string>

The output of read is a list of Instance s with the fields:
source_tokens: TextField and target_tokens: TextField
Parameters:

source_tokenizer : Tokenizer, optional

Tokenizer to use to split the input sequences into words or other kinds of tokens. Defaults to WordTokenizer().

target_tokenizer : Tokenizer, optional

Tokenizer to use to split the output sequences (during training) into words or other kinds of tokens. Defaults to source_tokenizer.

source_token_indexers : Dict[str, TokenIndexer], optional

Indexers used to define input (source side) token representations. Defaults to {"tokens": SingleIdTokenIndexer()}.

target_token_indexers : Dict[str, TokenIndexer], optional

Indexers used to define output (target side) token representations. Defaults to source_token_indexers.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.dataset_readers.seq2seq.Seq2SeqDatasetReader[source]
read(file_path)[source]

Actually reads some data from the file_path and returns a Dataset.

text_to_instance(source_string: str, target_string: str = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.