Reader for WikitableQuestions (https://github.com/ppasupat/WikiTableQuestions/releases/tag/v1.0.2).
WikiTablesDatasetReader(lazy: bool = False, tables_directory: str = None, dpd_output_directory: str = None, max_dpd_logical_forms: int = 10, sort_dpd_logical_forms: bool = True, max_dpd_tries: int = 20, keep_if_no_dpd: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, table_token_indexers: Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, use_table_for_vocab: bool = False, linking_feature_extractors: List[str] = None, include_table_metadata: bool = False, max_table_tokens: int = None, output_agendas: bool = False)¶
*.examplesfiles and converts them into
Instancessuitable for use with the
WikiTablesSemanticParser. This reader also accepts pre-processed JSONL files produced by
scripts/preprocess_wikitables.py. Processing the example files to read a bunch of individual table files, run NER on all of the tables, and convert logical forms into action sequences is quite slow, so we recommend you run the pre-processing script. Having the same reader handle both file types allows you to train with a pre-processed file, but not have to change your model configuration in order to serve a demo from the trained model.
*.examplesfiles have pointers in them to two other files: a file that contains an associated table for each question, and a file that has pre-computed, possible logical forms. Because of how the
DatasetReaderAPI works, we need to take base directories for those other files in the constructor.
We initialize the dataset reader with paths to the tables directory and the directory where DPD output is stored if you are training. While testing, you can either provide existing table filenames or if your question is about a new table, provide the content of the table as a dict (See
TableQuestionKnowledgeGraph.read_from_json()for the expected format). If you are doing the former, you still need to provide a
For training, we assume you are reading in
data/*.examplesfiles, and you have access to the output from Dynamic Programming on Denotations (DPD) on the training dataset.
We lowercase the question and all table text, because the questions in the data are typically all lowercase, anyway. This makes it so that any live demo that you put up will have questions that match the data this was trained on. Lowercasing the table text makes matching the lowercased question text easier.
- lazy :
DatasetReader. If this is
True, training will start sooner, but will take longer per batch.
- tables_directory :
Prefix for the path to the directory in which the tables reside. For example,
*.examplesfiles contain paths like
csv/204-csv/590.csv, this is the directory that contains the
csvdirectory. This is only optional for
Predictors(i.e., in a demo), where you’re only calling
- dpd_output_directory :
Directory that contains all the gzipped dpd output files. We assume the filenames match the example IDs (e.g.:
nt-0.gz). This is required for training a model, but not required for prediction.
- max_dpd_logical_forms :
int, optional (default=10)
We will use the first
max_dpd_logical_formslogical forms as our target label. Only applicable if
- sort_dpd_logical_forms :
bool, optional (default=True)
True, we will sort the logical forms in the DPD output by length before selecting the first
max_dpd_logical_forms. This makes the data loading quite a bit slower, but results in better training data.
- max_dpd_tries :
Sometimes DPD just made bad choices about logical forms and gives us forms that we can’t parse (most of the time these are very unlikely logical forms, because, e.g., it hallucinates a date or number from the table that’s not in the question). But we don’t want to spend our time trying to parse thousands of bad logical forms. We will try to parse only the first
max_dpd_trieslogical forms before giving up. This also speeds up data loading time, because we don’t go through the entire DPD file if it’s huge (unless we’re sorting the logical forms). Only applicable if
dpd_output_directoryis given. Default is 20.
- keep_if_no_dpd :
bool, optional (default=False)
True, we will keep instances we read that don’t have DPD output. If you want to compute denotation accuracy on the full dataset, you should set this to
True. Otherwise, your accuracy numbers will only reflect the subset of the data that has DPD output.
- tokenizer :
Tokenizer to use for the questions. Will default to
WordTokenizer()with Spacy’s tagger enabled, as we use lemma matches as features for entity linking.
- question_token_indexers :
Dict[str, TokenIndexer], optional
Token indexers for questions. Will default to
- table_token_indexers :
Dict[str, TokenIndexer], optional
Token indexers for table entities. Will default to
question_token_indexers(though you very likely want to use something different for these, as you can’t rely on having an embedding for every table entity at test time).
- use_table_for_vocab :
True, we will include table cell text in vocabulary creation. The original parser did not do this, because the same table can appear multiple times, messing with vocab counts, and making you include lots of rare entities in your vocab.
- linking_feature_extractors :
The list of feature extractors to use in the
KnowledgeGraphFieldwhen computing entity linking features. See that class for more information. By default, we will use all available feature extractors.
- include_table_metadata :
This is necessary for pre-processing the data. We output a jsonl file that has all of the information necessary for reading each instance, which includes the table contents itself. This flag tells the reader to include a
table_metadatafield that gets read by the pre-processing script.
- max_table_tokens :
If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.
- output_agendas :
bool, (optional, default=False)
Should we output agenda fields? This needs to be true if you want to train a coverage based parser.
text_to_instance(question: str, table_lines: List[str], example_lisp_string: str = None, dpd_output: List[str] = None, tokenized_question: List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance¶
Reads text inputs and makes an instance. WikitableQuestions dataset provides tables as TSV files, which we use for training.
- question :
- table_lines :
The table content itself, as a list of rows. See
TableQuestionKnowledgeGraph.read_from_linesfor the expected format.
- example_lisp_string :
The original (lisp-formatted) example string in the WikiTableQuestions dataset. This comes directly from the
.examplesfile provided with the dataset. We pass this to SEMPRE for evaluating logical forms during training. It isn’t otherwise used for anything.
- dpd_output : List[str], optional
List of logical forms, produced by dynamic programming on denotations. Not required during test.
- tokenized_question :
If you have already tokenized the question, you can pass that in here, so we don’t duplicate that work. You might, for example, do batch processing on the questions in the whole dataset, then pass the result in here.
- question :
- lazy :
parse_example_line(lisp_string: str) → Dict¶
- Training data in WikitableQuestions comes with examples in the form of lisp strings in the format:
- (example (id <example-id>)
- (utterance <question>) (context (graph tables.TableKnowledgeGraph <table-filename>)) (targetValue (list (description <answer1>) (description <answer2>) …)))
We parse such strings and return the parsed information here.