allennlp.data.dataset_readers.reading_comprehension

Reading comprehension is loosely defined as follows: given a question and a passage of text that contains the answer, answer the question.

These submodules contain readers for things that are predominantly reading comprehension datasets.

class allennlp.data.dataset_readers.reading_comprehension.squad.SquadReader(tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None) → None[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads a JSON-formatted SQuAD file and returns a Dataset where the Instances have four fields: question, a TextField, passage, another TextField, and span_start and span_end, both IndexFields into the passage TextField. We also add a MetadataField that stores the instance’s ID, the original passage text, gold answer strings, and token offsets into the original passage, accessible as metadata['id'], metadata['original_passage'], metadata['answer_texts'] and metadata['token_offsets']. This is so that we can more easily use the official SQuAD evaluation script to get metrics.

Parameters:

tokenizer : Tokenizer, optional (default=``WordTokenizer()``)

We use this Tokenizer for both the question and the passage. See Tokenizer. Default is `WordTokenizer().

token_indexers : Dict[str, TokenIndexer], optional

We similarly use this for both the question and the passage. See TokenIndexer. Default is {"tokens": SingleIdTokenIndexer()}.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.dataset_readers.reading_comprehension.squad.SquadReader[source]
read(file_path: str)[source]

Actually reads some data from the file_path and returns a Dataset.

text_to_instance(question_text: str, passage_text: str, char_spans: typing.List[typing.Tuple[int, int]] = None, answer_texts: typing.List[str] = None, passage_tokens: typing.List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

class allennlp.data.dataset_readers.reading_comprehension.triviaqa.TriviaQaReader(base_tarball_path: str, unfiltered_tarball_path: str = None, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None) → None[source]

Bases: allennlp.data.dataset_readers.dataset_reader.DatasetReader

Reads the TriviaQA dataset into a Dataset containing Instances with four fields: question (a TextField), passage (another TextField), span_start, and span_end (both IndexFields).

TriviaQA is split up into several JSON files defining the questions, and a lot of text files containing crawled web documents. We read these from a gzipped tarball, to avoid having to have millions of individual files on a filesystem.

Because we need to read both train and validation files from the same tarball, we take the tarball itself as a constructor parameter, and take the question file as the argument to read. This means that you should give the path to the tarball in the dataset_reader parameters in your experiment configuration file, and something like "wikipedia-train.json" for the train_data_path and validation_data_path.

Parameters:

base_tarball_path : str

This is the path to the main tar.gz file you can download from the TriviaQA website, with directories evidence and qa.

unfiltered_tarball_path : str, optional

This is the path to the “unfiltered” TriviaQA data that you can download from the TriviaQA website, containing just question JSON files that point to evidence files in the base tarball.

tokenizer : Tokenizer, optional

We’ll use this tokenizer on questions and evidence passages, defaulting to WordTokenizer if none is provided.

token_indexers : Dict[str, TokenIndexer], optional

Determines how both the question and the evidence passages are represented as arrays. See TokenIndexer. Default is to have a single word ID for every token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.dataset_readers.reading_comprehension.triviaqa.TriviaQaReader[source]
pick_paragraphs(evidence_files: typing.List[typing.List[str]], question: str = None, answer_texts: typing.List[str] = None) → typing.List[str][source]

Given a list of evidence documents, return a list of paragraphs to use as training examples. Each paragraph returned will be made into one training example.

To aid in picking the best paragraph, you can also optionally pass the question text or the answer strings. Note, though, that if you actually use the answer strings for picking the paragraph on the dev or test sets, that’s likely cheating, depending on how you’ve defined the task.

read(file_path: str)[source]

Actually reads some data from the file_path and returns a Dataset.

text_to_instance(question_text: str, passage_text: str, token_spans: typing.List[typing.Tuple[int, int]] = None, answer_texts: typing.List[str] = None, question_tokens: typing.List[allennlp.data.tokenizers.token.Token] = None, passage_tokens: typing.List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]

Does whatever tokenization or processing is necessary to go from textual input to an Instance. The primary intended use for this is with a Predictor, which gets text input as a JSON object and needs to process it to be input to a model.

The intent here is to share code between read() and what happens at model serving time, or any other time you want to make a prediction from new data. We need to process the data in the same way it was done at training time. Allowing the DatasetReader to process new text lets us accomplish this, as we can just call DatasetReader.text_to_instance when serving predictions.

The input type here is rather vaguely specified, unfortunately. The Predictor will have to make some assumptions about the kind of DatasetReader that it’s using, in order to pass it the right information.

Utilities for reading comprehension dataset readers.

allennlp.data.dataset_readers.reading_comprehension.util.char_span_to_token_span(token_offsets: typing.List[typing.Tuple[int, int]], character_span: typing.Tuple[int, int]) → typing.Tuple[typing.Tuple[int, int], bool][source]

Converts a character span from a passage into the corresponding token span in the tokenized version of the passage. If you pass in a character span that does not correspond to complete tokens in the tokenized version, we’ll do our best, but the behavior is officially undefined. We return an error flag in this case, and have some debug logging so you can figure out the cause of this issue (in SQuAD, these are mostly either tokenization problems or annotation problems; there’s a fair amount of both).

The basic outline of this method is to find the token span that has the same offsets as the input character span. If the tokenizer tokenized the passage correctly and has matching offsets, this is easy. We try to be a little smart about cases where they don’t match exactly, but mostly just find the closest thing we can.

The returned (begin, end) indices are inclusive for both begin and end. So, for example, (2, 2) is the one word span beginning at token index 2, (3, 4) is the two-word span beginning at token index 3, and so on.

Returns:

token_span : Tuple[int, int]

Inclusive span start and end token indices that match as closely as possible to the input character spans.

error : bool

Whether the token spans match the input character spans exactly. If this is False, it means there was an error in either the tokenization or the annotated character span.

allennlp.data.dataset_readers.reading_comprehension.util.find_valid_answer_spans(passage_tokens: typing.List[allennlp.data.tokenizers.token.Token], answer_texts: typing.List[str]) → typing.List[typing.Tuple[int, int]][source]

Finds a list of token spans in passage_tokens that match the given answer_texts. This tries to find all spans that would evaluate to correct given the SQuAD and TriviaQA official evaluation scripts, which do some normalization of the input text.

Note that this could return duplicate spans! The caller is expected to be able to handle possible duplicates (as already happens in the SQuAD dev set, for instance).

allennlp.data.dataset_readers.reading_comprehension.util.make_reading_comprehension_instance(question_tokens: typing.List[allennlp.data.tokenizers.token.Token], passage_tokens: typing.List[allennlp.data.tokenizers.token.Token], token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer], passage_text: str, token_spans: typing.List[typing.Tuple[int, int]] = None, answer_texts: typing.List[str] = None, additional_metadata: typing.Dict[str, typing.Any] = None) → allennlp.data.instance.Instance[source]

Converts a question, a passage, and an optional answer (or answers) to an Instance for use in a reading comprehension model.

Creates an Instance with at least these fields: question and passage, both TextFields; and metadata, a MetadataField. Additionally, if both answer_texts and char_span_starts are given, the Instance has span_start and span_end fields, which are both IndexFields.

Parameters:

question_tokens : List[Token]

An already-tokenized question.

passage_tokens : List[Token]

An already-tokenized passage that contains the answer to the given question.

token_indexers : Dict[str, TokenIndexer]

Determines how the question and passage TextFields will be converted into tensors that get input to a model. See TokenIndexer.

passage_text : str

The original passage text. We need this so that we can recover the actual span from the original passage that the model predicts as the answer to the question. This is used in official evaluation scripts.

token_spans : List[Tuple[int, int]], optional

Indices into passage_tokens to use as the answer to the question for training. This is a list because there might be several possible correct answer spans in the passage. Currently, we just select the most frequent span in this list (i.e., SQuAD has multiple annotations on the dev set; this will select the span that the most annotators gave as correct).

answer_texts : List[str], optional

All valid answer strings for the given question. In SQuAD, e.g., the training set has exactly one answer per question, but the dev and test sets have several. TriviaQA has many possible answers, which are the aliases for the known correct entity. This is put into the metadata for use with official evaluation scripts, but not used anywhere else.

additional_metadata : Dict[str, Any], optional

The constructed metadata field will by default contain original_passage, token_offsets, and answer_texts keys. If you want any other metadata to be associated with each instance, you can pass that in here. This dictionary will get added to the metadata dictionary we already construct.

allennlp.data.dataset_readers.reading_comprehension.util.normalize_text(text: str) → str[source]

Performs a normalization that is very similar to that done by the normalization functions in SQuAD and TriviaQA.

This involves splitting and rejoining the text, and could be a somewhat expensive operation.