allennlp.semparse.contexts

A KnowledgeGraph is a graphical representation of some structured knowledge source: say a table, figure or an explicit knowledge base.

class allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph(entities: typing.Set[str], neighbors: typing.Dict[str, typing.List[str]], entity_text: typing.Dict[str, str] = None) → None[source]

Bases: object

A KnowledgeGraph represents a collection of entities and their relationships.

The KnowledgeGraph currently stores (untyped) neighborhood information and text representations of each entity (if there is any).

The knowledge base itself can be a table (like in WikitableQuestions), a figure (like in NLVR) or some other structured knowledge source. This abstract class needs to be inherited for implementing the functionality appropriate for a given KB.

All of the parameters listed below are stored as public attributes.

Parameters:
entities : Set[str]

The string identifiers of the entities in this knowledge graph. We sort this set and store it as a list. The sorting is so that we get a guaranteed consistent ordering across separate runs of the code.

neighbors : Dict[str, List[str]]

A mapping from string identifiers to other string identifiers, denoting which entities are neighbors in the graph.

entity_text : Dict[str, str]

If you have additional text associated with each entity (other than its string identifier), you can store that here. This might be, e.g., the text in a table cell, or the description of a wikipedia entity.

class allennlp.semparse.contexts.table_question_knowledge_graph.TableQuestionKnowledgeGraph(entities: typing.Set[str], neighbors: typing.Dict[str, typing.List[str]], entity_text: typing.Dict[str, str], question_tokens: typing.List[allennlp.data.tokenizers.token.Token]) → None[source]

Bases: allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph

A TableQuestionKnowledgeGraph represents the linkable entities in a table and a question about the table. The linkable entities in a table are the cells and the columns of the table, and the linkable entities from the question are the numbers in the question. We use the question to define our space of allowable numbers, because there are infinitely many numbers that we could include in our action space, and we really don’t want to do that. Additionally, we have a method that returns the set of entities in the graph that are relevant to the question, and we keep the question for this method. See get_linked_agenda_items for more information.

To represent the table as a graph, we make each cell and column a node in the graph, and consider a column’s neighbors to be all cells in that column (and thus each cell has just one neighbor - the column it belongs to). This is a rather simplistic view of the table. For example, we don’t store the order of rows.

We represent numbers as standalone nodes in the graph, without any neighbors.

Additionally, when we encounter cells that can be split, we create fb:part.[something] entities, also without any neighbors.

cell_part_regex = re.compile(',\\s|\\n|/')
get_linked_agenda_items() → typing.List[str][source]

Returns entities that can be linked to spans in the question, that should be in the agenda, for training a coverage based semantic parser. This method essentially does a heuristic entity linking, to provide weak supervision for a learning to search parser.

classmethod read_from_file(filename: str, question: typing.List[allennlp.data.tokenizers.token.Token]) → allennlp.semparse.contexts.table_question_knowledge_graph.TableQuestionKnowledgeGraph[source]

We read tables formatted as TSV files here. We assume the first line in the file is a tab separated list of column headers, and all subsequent lines are content rows. For example if the TSV file is:

Nation Olympics Medals USA 1896 8 China 1932 9

we read “Nation”, “Olympics” and “Medals” as column headers, “USA” and “China” as cells under the “Nation” column and so on.

classmethod read_from_json(json_object: typing.Dict[str, typing.Any]) → allennlp.semparse.contexts.table_question_knowledge_graph.TableQuestionKnowledgeGraph[source]

We read tables formatted as JSON objects (dicts) here. This is useful when you are reading data from a demo. The expected format is:

{"question": [token1, token2, ...],
 "columns": [column1, column2, ...],
 "cells": [[row1_cell1, row1_cell2, ...],
           [row2_cell1, row2_cell2, ...],
           ... ]}
classmethod read_from_lines(lines: typing.List[str], question: typing.List[allennlp.data.tokenizers.token.Token]) → allennlp.semparse.contexts.table_question_knowledge_graph.TableQuestionKnowledgeGraph[source]
class allennlp.semparse.contexts.table_question_context.TableQuestionContext(table_data: typing.List[typing.Dict[str, str]], column_types: typing.Dict[str, str], question_tokens: typing.List[allennlp.data.tokenizers.token.Token]) → None[source]

Bases: object

A barebones implementation similar to https://github.com/crazydonkey200/neural-symbolic-machines/blob/master/table/wtq/preprocess.py for extracting entities from a question given a table and type its columns with <string> | <date> | <number>

MAX_TOKENS_FOR_NUM_CELL = 2
get_entities_from_question() → typing.Tuple[typing.List[typing.Tuple[str, str]], typing.List[typing.Tuple[str, int]]][source]
get_table_knowledge_graph() → allennlp.semparse.contexts.knowledge_graph.KnowledgeGraph[source]
static normalize_string(string: str) → str[source]

These are the transformation rules used to normalize cell in column names in Sempre. See edu.stanford.nlp.sempre.tables.StringNormalizationUtils.characterNormalize and edu.stanford.nlp.sempre.tables.TableTypeSystem.canonicalizeName. We reproduce those rules here to normalize and canonicalize cells and columns in the same way so that we can match them against constants in logical forms appropriately.

classmethod read_from_file(filename: str, question_tokens: typing.List[allennlp.data.tokenizers.token.Token]) → allennlp.semparse.contexts.table_question_context.TableQuestionContext[source]
classmethod read_from_lines(lines: typing.List[typing.List[str]], question_tokens: typing.List[allennlp.data.tokenizers.token.Token]) → allennlp.semparse.contexts.table_question_context.TableQuestionContext[source]
allennlp.semparse.contexts.atis_tables.am_map_match_to_query_value(match: str)[source]
allennlp.semparse.contexts.atis_tables.convert_to_string_list_value_dict(trigger_dict: typing.Dict[str, int]) → typing.Dict[str, typing.List[str]][source]
allennlp.semparse.contexts.atis_tables.digit_to_query_time(digit: str) → typing.List[int][source]

Given a digit in the utterance, return a list of the times that it corresponds to.

allennlp.semparse.contexts.atis_tables.get_approximate_times(times: typing.List[int]) → typing.List[int][source]

Given a list of times that follow a word such as about, we return a list of times that could appear in the query as a result of this. For example if about 7pm appears in the utterance, then we also want to add 1830 and 1930.

allennlp.semparse.contexts.atis_tables.get_costs_from_utterance(utterance: str, tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token]) → typing.Dict[str, typing.List[int]][source]
allennlp.semparse.contexts.atis_tables.get_date_from_utterance(tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token], year: int = 1993) → typing.List[datetime.datetime][source]

When the year is not explicitly mentioned in the utterance, the query assumes that it is 1993 so we do the same here. If there is no mention of the month or day then we do not return any dates from the utterance.

allennlp.semparse.contexts.atis_tables.get_flight_numbers_from_utterance(utterance: str, tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token]) → typing.Dict[str, typing.List[int]][source]
allennlp.semparse.contexts.atis_tables.get_numbers_from_utterance(utterance: str, tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token]) → typing.Dict[str, typing.List[int]][source]

Given an utterance, this function finds all the numbers that are in the action space. Since we need to keep track of linking scores, we represent the numbers as a dictionary, where the keys are the string representation of the number and the values are lists of the token indices that triggers that number.

allennlp.semparse.contexts.atis_tables.get_time_range_end_from_utterance(utterance: str, tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token]) → typing.Dict[str, typing.List[int]][source]
allennlp.semparse.contexts.atis_tables.get_time_range_start_from_utterance(utterance: str, tokenized_utterance: typing.List[allennlp.data.tokenizers.token.Token]) → typing.Dict[str, typing.List[int]][source]
allennlp.semparse.contexts.atis_tables.get_times_from_utterance(utterance: str, char_offset_to_token_index: typing.Dict[int, int], indices_of_approximate_words: typing.Set[int]) → typing.Dict[str, typing.List[int]][source]

Given an utterance, we get the numbers that correspond to times and convert them to values that may appear in the query. For example: convert 7pm to 1900.

allennlp.semparse.contexts.atis_tables.get_trigger_dict(trigger_lists: typing.List[typing.List[str]], trigger_dicts: typing.List[typing.Dict[str, typing.List[str]]]) → typing.Dict[str, typing.List[str]][source]
allennlp.semparse.contexts.atis_tables.pm_map_match_to_query_value(match: str)[source]

An AtisSqlTableContext represents the SQL context in which an utterance appears for the Atis dataset, with the grammar and the valid actions.

class allennlp.semparse.contexts.atis_sql_table_context.AtisSqlTableContext(all_tables: typing.Dict[str, typing.List[str]] = None, tables_with_strings: typing.Dict[str, typing.List[str]] = None, database_file: str = None) → None[source]

Bases: object

An AtisSqlTableContext represents the SQL context with a grammar of SQL and the valid actions based on the schema of the tables that it represents.

Parameters:
all_tables: ``Dict[str, List[str]]``

A dictionary representing the SQL tables in the dataset, the keys are the names of the tables that map to lists of the table’s column names.

tables_with_strings: ``Dict[str, List[str]]``

A dictionary representing the SQL tables that we want to generate strings for. The keys are the names of the tables that map to lists of the table’s column names.

database_file : str, optional

The directory to find the sqlite database file. We query the sqlite database to find the strings that are allowed.

create_grammar_dict_and_strings() → typing.Tuple[typing.Dict[str, typing.List[str]], typing.List[typing.Tuple[str, str]]][source]
get_grammar_dictionary() → typing.Dict[str, typing.List[str]][source]
get_grammar_string()[source]
get_valid_actions() → typing.Dict[str, typing.List[str]][source]

A Text2SqlTableContext represents the SQL context in which an utterance appears for the any of the text2sql datasets, with the grammar and the valid actions.

allennlp.semparse.contexts.text2sql_table_context.update_grammar_numbers_and_strings_with_variables(grammar_dictionary: typing.Dict[str, typing.List[str]], prelinked_entities: typing.Dict[str, typing.Dict[str, str]], columns: typing.Dict[str, allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]) → None[source]
allennlp.semparse.contexts.text2sql_table_context.update_grammar_to_be_variable_free(grammar_dictionary: typing.Dict[str, typing.List[str]])[source]

SQL is a predominately variable free language in terms of simple usage, in the sense that most queries do not create references to variables which are not already static tables in a dataset. However, it is possible to do this via derived tables. If we don’t require this functionality, we can tighten the grammar, because we don’t need to support aliased tables.

allennlp.semparse.contexts.text2sql_table_context.update_grammar_values_with_variables(grammar_dictionary: typing.Dict[str, typing.List[str]], prelinked_entities: typing.Dict[str, typing.Dict[str, str]]) → None[source]
allennlp.semparse.contexts.text2sql_table_context.update_grammar_with_global_values(grammar_dictionary: typing.Dict[str, typing.List[str]], dataset_name: str)[source]
allennlp.semparse.contexts.text2sql_table_context.update_grammar_with_table_values(grammar_dictionary: typing.Dict[str, typing.List[str]], schema: typing.Dict[str, typing.List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]], cursor: sqlite3.Cursor) → None[source]
allennlp.semparse.contexts.text2sql_table_context.update_grammar_with_tables(grammar_dictionary: typing.Dict[str, typing.List[str]], schema: typing.Dict[str, typing.List[allennlp.data.dataset_readers.dataset_utils.text2sql_utils.TableColumn]]) → None[source]
allennlp.semparse.contexts.text2sql_table_context.update_grammar_with_untyped_entities(grammar_dictionary: typing.Dict[str, typing.List[str]]) → None[source]

Variables can be treated as numbers or strings if their type can be inferred - however, that can be difficult, so instead, we can just treat them all as values and be a bit looser on the typing we allow in our grammar. Here we just remove all references to number and string from the grammar, replacing them with value.

class allennlp.semparse.contexts.sql_context_utils.SqlVisitor(grammar: parsimonious.grammar.Grammar, keywords_to_uppercase: typing.List[str] = None) → None[source]

Bases: parsimonious.nodes.NodeVisitor

SqlVisitor performs a depth-first traversal of the the AST. It takes the parse tree and gives us an action sequence that resulted in that parse. Since the visitor has mutable state, we define a new SqlVisitor for each query. To get the action sequence, we create a SqlVisitor and call parse on it, which returns a list of actions. Ex.

sql_visitor = SqlVisitor(grammar_string) action_sequence = sql_visitor.parse(query)

Importantly, this SqlVisitor skips over ws and wsp nodes, because they do not hold any meaning, and make an action sequence much longer than it needs to be.

Parameters:
grammar : Grammar

A Grammar object that we use to parse the text.

keywords_to_uppecase: ``List[str]``, optional, (default = None)

Keywords in the grammar to uppercase. In the case of sql, this might be SELECT, MAX etc.

add_action(node: parsimonious.nodes.Node) → None[source]

For each node, we accumulate the rules that generated its children in a list.

generic_visit(node: parsimonious.nodes.Node, visited_children: typing.List[NoneType]) → typing.List[str][source]

Default visitor method

Parameters:
  • node – The node we’re visiting
  • visited_children – The results of visiting the children of that node, in a list

I’m not sure there’s an implementation of this that makes sense across all (or even most) use cases, so we leave it to subclasses to implement for now.

visit(node)[source]

See the NodeVisitor visit method. This just changes the order in which we visit nonterminals from right to left to left to right.

allennlp.semparse.contexts.sql_context_utils.action_sequence_to_sql(action_sequences: typing.List[str]) → str[source]
allennlp.semparse.contexts.sql_context_utils.format_action(nonterminal: str, right_hand_side: str, is_string: bool = False, is_number: bool = False, keywords_to_uppercase: typing.List[str] = None) → str[source]

This function formats an action as it appears in models. It splits productions based on the special ws and wsp rules, which are used in grammars to denote whitespace, and then rejoins these tokens a formatted, comma separated list. Importantly, note that it does not split on spaces in the grammar string, because these might not correspond to spaces in the language the grammar recognises.

Parameters:
nonterminal : str, required.

The nonterminal in the action.

right_hand_side : str, required.

The right hand side of the action (i.e the thing which is produced).

is_string : bool, optional (default = False).

Whether the production produces a string. If it does, it is formatted as nonterminal -> ['string']

is_number : bool, optional, (default = False).

Whether the production produces a string. If it does, it is formatted as nonterminal -> ['number']

keywords_to_uppecase: ``List[str]``, optional, (default = None)

Keywords in the grammar to uppercase. In the case of sql, this might be SELECT, MAX etc.

allennlp.semparse.contexts.sql_context_utils.format_grammar_string(grammar_dictionary: typing.Dict[str, typing.List[str]]) → str[source]

Formats a dictionary of production rules into the string format expected by the Parsimonious Grammar class.

allennlp.semparse.contexts.sql_context_utils.initialize_valid_actions(grammar: parsimonious.grammar.Grammar, keywords_to_uppercase: typing.List[str] = None) → typing.Dict[str, typing.List[str]][source]

We initialize the valid actions with the global actions. These include the valid actions that result from the grammar and also those that result from the tables provided. The keys represent the nonterminals in the grammar and the values are lists of the valid actions of that nonterminal.

class allennlp.semparse.contexts.quarel_utils.WorldTaggerExtractor(tagger_archive)[source]

Bases: object

get_world_entities(question: str, tokenized_question: typing.List[allennlp.data.tokenizers.token.Token] = None) → typing.Dict[str, typing.List[str]][source]
allennlp.semparse.contexts.quarel_utils.align_entities(extracted: typing.List[str], literals: typing.Dict[str, typing.Any], stemmer: nltk.stem.porter.PorterStemmer) → typing.List[str][source]

Use stemming to attempt alignment between extracted world and given world literals. If more words align to one world vs the other, it’s considered aligned.

allennlp.semparse.contexts.quarel_utils.delete_duplicates(expr: typing.List) → typing.List[source]
allennlp.semparse.contexts.quarel_utils.from_bio(tags: typing.List[str], target: str) → typing.List[typing.Tuple[int, int]][source]
allennlp.semparse.contexts.quarel_utils.from_entity_cues_string(cues_string: str) → typing.Dict[str, typing.List[str]][source]
allennlp.semparse.contexts.quarel_utils.from_qr_spec_string(qr_spec: str) → typing.List[typing.Dict[str, int]][source]
allennlp.semparse.contexts.quarel_utils.get_explanation(logical_form: str, world_extractions: typing.Dict[str, typing.Any], answer_index: int, world: allennlp.semparse.worlds.quarel_world.QuarelWorld) → typing.List[typing.Dict[str, typing.Any]][source]

Create explanation (as a list of header/content entries) for an answer

allennlp.semparse.contexts.quarel_utils.get_stem_overlaps(query: str, references: typing.List[str], stemmer: nltk.stem.porter.PorterStemmer) → typing.List[int][source]
allennlp.semparse.contexts.quarel_utils.get_words(string: str) → typing.List[str][source]
allennlp.semparse.contexts.quarel_utils.group_worlds(tags: typing.List[str], tokens: typing.List[str]) → typing.Dict[str, typing.List[str]][source]
allennlp.semparse.contexts.quarel_utils.nl_arg(arg: typing.Any, nl_world: typing.Dict[str, typing.Any]) → typing.Any[source]
allennlp.semparse.contexts.quarel_utils.nl_attr(attr: str) → str[source]
allennlp.semparse.contexts.quarel_utils.nl_dir(sign: int) → str[source]
allennlp.semparse.contexts.quarel_utils.nl_triple(triple: typing.List[str], nl_world: typing.Dict[str, typing.Any]) → str[source]
allennlp.semparse.contexts.quarel_utils.nl_world_string(world: typing.List[str]) → str[source]
allennlp.semparse.contexts.quarel_utils.split_question(question: str) → typing.List[str][source]
allennlp.semparse.contexts.quarel_utils.str_join(string_or_list: typing.Union[str, typing.List[str]], joiner: str, prefixes: str = '', postfixes: str = '') → str[source]
allennlp.semparse.contexts.quarel_utils.strip_entity_type(entity: str) → str[source]
allennlp.semparse.contexts.quarel_utils.to_camel(string: str) → str[source]
allennlp.semparse.contexts.quarel_utils.to_qr_spec_string(qr_coeff_sets: typing.List[typing.Dict[str, int]]) → str[source]
allennlp.semparse.contexts.quarel_utils.words_from_entity_string(entity: str) → str[source]