class allennlp.data.dataset_readers.semantic_parsing.wikitables.wikitables.WikiTablesDatasetReader(lazy: bool = False, tables_directory: str = None, dpd_output_directory: str = None, max_dpd_logical_forms: int = 10, sort_dpd_logical_forms: bool = True, max_dpd_tries: int = 20, keep_if_no_dpd: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, table_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, use_table_for_vocab: bool = False, linking_feature_extractors: List[str] = None, include_table_metadata: bool = False, max_table_tokens: int = None, output_agendas: bool = False)[source]

This DatasetReader takes WikiTableQuestions *.examples files and converts them into Instances suitable for use with the WikiTablesSemanticParser. This reader also accepts pre-processed JSONL files produced by scripts/preprocess_wikitables.py. Processing the example files to read a bunch of individual table files, run NER on all of the tables, and convert logical forms into action sequences is quite slow, so we recommend you run the pre-processing script. Having the same reader handle both file types allows you to train with a pre-processed file, but not have to change your model configuration in order to serve a demo from the trained model.

The *.examples files have pointers in them to two other files: a file that contains an associated table for each question, and a file that has pre-computed, possible logical forms. Because of how the DatasetReader API works, we need to take base directories for those other files in the constructor.

We initialize the dataset reader with paths to the tables directory and the directory where DPD output is stored if you are training. While testing, you can either provide existing table filenames or if your question is about a new table, provide the content of the table as a dict (See TableQuestionKnowledgeGraph.read_from_json() for the expected format). If you are doing the former, you still need to provide a tables_directory path here.

For training, we assume you are reading in data/*.examples files, and you have access to the output from Dynamic Programming on Denotations (DPD) on the training dataset.

We lowercase the question and all table text, because the questions in the data are typically all lowercase, anyway. This makes it so that any live demo that you put up will have questions that match the data this was trained on. Lowercasing the table text makes matching the lowercased question text easier.

Parameters
lazybool (optional, default=False)

Passed to DatasetReader. If this is True, training will start sooner, but will take longer per batch.

tables_directorystr, optional

Prefix for the path to the directory in which the tables reside. For example, *.examples files contain paths like csv/204-csv/590.csv, this is the directory that contains the csv directory. This is only optional for Predictors (i.e., in a demo), where you’re only calling text_to_instance().

dpd_output_directorystr, optional

Directory that contains all the gzipped dpd output files. We assume the filenames match the example IDs (e.g.: nt-0.gz). This is required for training a model, but not required for prediction.

max_dpd_logical_formsint, optional (default=10)

We will use the first max_dpd_logical_forms logical forms as our target label. Only applicable if dpd_output_directory is given.

sort_dpd_logical_formsbool, optional (default=True)

If True, we will sort the logical forms in the DPD output by length before selecting the first max_dpd_logical_forms. This makes the data loading quite a bit slower, but results in better training data.

max_dpd_triesint, optional

Sometimes DPD just made bad choices about logical forms and gives us forms that we can’t parse (most of the time these are very unlikely logical forms, because, e.g., it hallucinates a date or number from the table that’s not in the question). But we don’t want to spend our time trying to parse thousands of bad logical forms. We will try to parse only the first max_dpd_tries logical forms before giving up. This also speeds up data loading time, because we don’t go through the entire DPD file if it’s huge (unless we’re sorting the logical forms). Only applicable if dpd_output_directory is given. Default is 20.

keep_if_no_dpdbool, optional (default=False)

If True, we will keep instances we read that don’t have DPD output. If you want to compute denotation accuracy on the full dataset, you should set this to True. Otherwise, your accuracy numbers will only reflect the subset of the data that has DPD output.

tokenizerTokenizer, optional

Tokenizer to use for the questions. Will default to WordTokenizer() with Spacy’s tagger enabled, as we use lemma matches as features for entity linking.

question_token_indexersDict[str, TokenIndexer], optional

Token indexers for questions. Will default to {"tokens": SingleIdTokenIndexer()}.

table_token_indexersDict[str, TokenIndexer], optional

Token indexers for table entities. Will default to question_token_indexers (though you very likely want to use something different for these, as you can’t rely on having an embedding for every table entity at test time).

use_table_for_vocabbool (optional, default=False)

If True, we will include table cell text in vocabulary creation. The original parser did not do this, because the same table can appear multiple times, messing with vocab counts, and making you include lots of rare entities in your vocab.

linking_feature_extractorsList[str], optional

The list of feature extractors to use in the KnowledgeGraphField when computing entity linking features. See that class for more information. By default, we will use all available feature extractors.

include_table_metadatabool (optional, default=False)

This is necessary for pre-processing the data. We output a jsonl file that has all of the information necessary for reading each instance, which includes the table contents itself. This flag tells the reader to include a table_metadata field that gets read by the pre-processing script.

max_table_tokensint, optional

If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.

output_agendasbool, (optional, default=False)

Should we output agenda fields? This needs to be true if you want to train a coverage based parser.

text_to_instance(self, question:str, table_lines:List[str], example_lisp_string:str=None, dpd_output:List[str]=None, tokenized_question:List[allennlp.data.tokenizers.token.Token]=None) → allennlp.data.instance.Instance[source]

Reads text inputs and makes an instance. WikitableQuestions dataset provides tables as TSV files, which we use for training.

Parameters
questionstr

Input question

table_linesList[str]

The table content itself, as a list of rows. See TableQuestionKnowledgeGraph.read_from_lines for the expected format.

example_lisp_stringstr, optional

The original (lisp-formatted) example string in the WikiTableQuestions dataset. This comes directly from the .examples file provided with the dataset. We pass this to SEMPRE for evaluating logical forms during training. It isn’t otherwise used for anything.

dpd_outputList[str], optional

List of logical forms, produced by dynamic programming on denotations. Not required during test.

tokenized_questionList[Token], optional

If you have already tokenized the question, you can pass that in here, so we don’t duplicate that work. You might, for example, do batch processing on the questions in the whole dataset, then pass the result in here.

allennlp.data.dataset_readers.semantic_parsing.wikitables.util.parse_example_line(lisp_string:str) → Dict[source]
Training data in WikitableQuestions comes with examples in the form of lisp strings in the format:
(example (id <example-id>)

(utterance <question>) (context (graph tables.TableKnowledgeGraph <table-filename>)) (targetValue (list (description <answer1>) (description <answer2>) …)))

We parse such strings and return the parsed information here.