allennlp.data.dataset_readers.semantic_parsing.wikitables¶
Reader for WikitableQuestions (https://github.com/ppasupat/WikiTableQuestions/releases/tag/v1.0.2).
-
class
allennlp.data.dataset_readers.semantic_parsing.wikitables.wikitables.
WikiTablesDatasetReader
(lazy: bool = False, tables_directory: str = None, dpd_output_directory: str = None, max_dpd_logical_forms: int = 10, sort_dpd_logical_forms: bool = True, max_dpd_tries: int = 20, keep_if_no_dpd: bool = False, tokenizer: allennlp.data.tokenizers.tokenizer.Tokenizer = None, question_token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, table_token_indexers: typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer] = None, use_table_for_vocab: bool = False, linking_feature_extractors: typing.List[str] = None, include_table_metadata: bool = False, max_table_tokens: int = None, output_agendas: bool = False) → None[source]¶ Bases:
allennlp.data.dataset_readers.dataset_reader.DatasetReader
This
DatasetReader
takes WikiTableQuestions*.examples
files and converts them intoInstances
suitable for use with theWikiTablesSemanticParser
. This reader also accepts pre-processed JSONL files produced byscripts/preprocess_wikitables.py
. Processing the example files to read a bunch of individual table files, run NER on all of the tables, and convert logical forms into action sequences is quite slow, so we recommend you run the pre-processing script. Having the same reader handle both file types allows you to train with a pre-processed file, but not have to change your model configuration in order to serve a demo from the trained model.The
*.examples
files have pointers in them to two other files: a file that contains an associated table for each question, and a file that has pre-computed, possible logical forms. Because of how theDatasetReader
API works, we need to take base directories for those other files in the constructor.We initialize the dataset reader with paths to the tables directory and the directory where DPD output is stored if you are training. While testing, you can either provide existing table filenames or if your question is about a new table, provide the content of the table as a dict (See
TableQuestionKnowledgeGraph.read_from_json()
for the expected format). If you are doing the former, you still need to provide atables_directory
path here.For training, we assume you are reading in
data/*.examples
files, and you have access to the output from Dynamic Programming on Denotations (DPD) on the training dataset.We lowercase the question and all table text, because the questions in the data are typically all lowercase, anyway. This makes it so that any live demo that you put up will have questions that match the data this was trained on. Lowercasing the table text makes matching the lowercased question text easier.
Parameters: - lazy :
bool
(optional, default=False) Passed to
DatasetReader
. If this isTrue
, training will start sooner, but will take longer per batch.- tables_directory :
str
, optional Prefix for the path to the directory in which the tables reside. For example,
*.examples
files contain paths likecsv/204-csv/590.csv
, this is the directory that contains thecsv
directory. This is only optional forPredictors
(i.e., in a demo), where you’re only callingtext_to_instance()
.- dpd_output_directory :
str
, optional Directory that contains all the gzipped dpd output files. We assume the filenames match the example IDs (e.g.:
nt-0.gz
). This is required for training a model, but not required for prediction.- max_dpd_logical_forms :
int
, optional (default=10) We will use the first
max_dpd_logical_forms
logical forms as our target label. Only applicable ifdpd_output_directory
is given.- sort_dpd_logical_forms :
bool
, optional (default=True) If
True
, we will sort the logical forms in the DPD output by length before selecting the firstmax_dpd_logical_forms
. This makes the data loading quite a bit slower, but results in better training data.- max_dpd_tries :
int
, optional Sometimes DPD just made bad choices about logical forms and gives us forms that we can’t parse (most of the time these are very unlikely logical forms, because, e.g., it hallucinates a date or number from the table that’s not in the question). But we don’t want to spend our time trying to parse thousands of bad logical forms. We will try to parse only the first
max_dpd_tries
logical forms before giving up. This also speeds up data loading time, because we don’t go through the entire DPD file if it’s huge (unless we’re sorting the logical forms). Only applicable ifdpd_output_directory
is given. Default is 20.- keep_if_no_dpd :
bool
, optional (default=False) If
True
, we will keep instances we read that don’t have DPD output. If you want to compute denotation accuracy on the full dataset, you should set this toTrue
. Otherwise, your accuracy numbers will only reflect the subset of the data that has DPD output.- tokenizer :
Tokenizer
, optional Tokenizer to use for the questions. Will default to
WordTokenizer()
with Spacy’s tagger enabled, as we use lemma matches as features for entity linking.- question_token_indexers :
Dict[str, TokenIndexer]
, optional Token indexers for questions. Will default to
{"tokens": SingleIdTokenIndexer()}
.- table_token_indexers :
Dict[str, TokenIndexer]
, optional Token indexers for table entities. Will default to
question_token_indexers
(though you very likely want to use something different for these, as you can’t rely on having an embedding for every table entity at test time).- use_table_for_vocab :
bool
(optional, default=False) If
True
, we will include table cell text in vocabulary creation. The original parser did not do this, because the same table can appear multiple times, messing with vocab counts, and making you include lots of rare entities in your vocab.- linking_feature_extractors :
List[str]
, optional The list of feature extractors to use in the
KnowledgeGraphField
when computing entity linking features. See that class for more information. By default, we will use all available feature extractors.- include_table_metadata :
bool
(optional, default=False) This is necessary for pre-processing the data. We output a jsonl file that has all of the information necessary for reading each instance, which includes the table contents itself. This flag tells the reader to include a
table_metadata
field that gets read by the pre-processing script.- max_table_tokens :
int
, optional If given, we will only keep this number of total table tokens. This bounds the memory usage of the table representations, truncating cells with really long text. We specify a total number of tokens, not a max cell text length, because the number of table entities varies.
- output_agendas :
bool
, (optional, default=False) Should we output agenda fields? This needs to be true if you want to train a coverage based parser.
-
text_to_instance
(question: str, table_lines: typing.List[str], example_lisp_string: str = None, dpd_output: typing.List[str] = None, tokenized_question: typing.List[allennlp.data.tokenizers.token.Token] = None) → allennlp.data.instance.Instance[source]¶ Reads text inputs and makes an instance. WikitableQuestions dataset provides tables as TSV files, which we use for training.
Parameters: - question :
str
Input question
- table_lines :
List[str]
The table content itself, as a list of rows. See
TableQuestionKnowledgeGraph.read_from_lines
for the expected format.- example_lisp_string :
str
, optional The original (lisp-formatted) example string in the WikiTableQuestions dataset. This comes directly from the
.examples
file provided with the dataset. We pass this to SEMPRE for evaluating logical forms during training. It isn’t otherwise used for anything.- dpd_output : List[str], optional
List of logical forms, produced by dynamic programming on denotations. Not required during test.
- tokenized_question :
List[Token]
, optional If you have already tokenized the question, you can pass that in here, so we don’t duplicate that work. You might, for example, do batch processing on the questions in the whole dataset, then pass the result in here.
- question :
- lazy :
-
allennlp.data.dataset_readers.semantic_parsing.wikitables.util.
parse_example_line
(lisp_string: str) → typing.Dict[source]¶ - Training data in WikitableQuestions comes with examples in the form of lisp strings in the format:
- (example (id <example-id>)
- (utterance <question>) (context (graph tables.TableKnowledgeGraph <table-filename>)) (targetValue (list (description <answer1>) (description <answer2>) ...)))
We parse such strings and return the parsed information here.