allennlp.data.token_indexers

A TokenIndexer determines how string tokens get represented as arrays of indices in a model.

class allennlp.data.token_indexers.token_indexer.TokenIndexer[source]

Bases: typing.Generic, allennlp.common.registrable.Registrable

A TokenIndexer determines how string tokens get represented as arrays of indices in a model. This class both converts strings into numerical values, with the help of a Vocabulary, and it produces actual arrays.

Tokens can be represented as single IDs (e.g., the word “cat” gets represented by the number 34), or as lists of character IDs (e.g., “cat” gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

default_implementation = 'single_id'
get_keys(index_name: str) → typing.List[str][source]

Return a list of the keys this indexer return from tokens_to_indices.

get_padding_lengths(token: TokenType) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → TokenType[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[TokenType]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[TokenType]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[TokenType]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.dep_label_indexer.DepLabelIndexer(namespace: str = 'dep_labels') → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their syntactic dependency label, as determined by the dep_ field on Token.

Parameters:
namespace : str, optional (default=``dep_labels``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[int]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.ner_tag_indexer.NerTagIndexer(namespace: str = 'ner_tags') → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their entity type (i.e., their NER tag), as determined by the ent_type_ field on Token.

Parameters:
namespace : str, optional (default=``ner_tags``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[int]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.pos_tag_indexer.PosTagIndexer(namespace: str = 'pos_tags', coarse_tags: bool = False) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their part of speech tag, as determined by the pos_ or tag_ fields on Token (corresponding to spacy’s coarse-grained and fine-grained POS tags, respectively).

Parameters:
namespace : str, optional (default=``pos_tags``)

We will use this namespace in the Vocabulary to map strings to indices.

coarse_tags : bool, optional (default=``False``)

If True, we will use coarse POS tags instead of the default fine-grained POS tags.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[int]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer(namespace: str = 'tokens', lowercase_tokens: bool = False) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as single integers.

Parameters:
namespace : str, optional (default=``tokens``)

We will use this namespace in the Vocabulary to map strings to indices.

lowercase_tokens : bool, optional (default=``False``)

If True, we will call token.lower() before getting an index for the token from the vocabulary.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[int]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer(namespace: str = 'token_characters', character_tokenizer: allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer = <allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer object>) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as lists of character indices.

Parameters:
namespace : str, optional (default=``token_characters``)

We will use this namespace in the Vocabulary to map the characters in each token to indices.

character_tokenizer : CharacterTokenizer, optional (default=``CharacterTokenizer()``)

We use a CharacterTokenizer to handle splitting tokens into characters, as it has options for byte encoding and other things. The default here is to instantiate a CharacterTokenizer with its default parameters, which uses unicode characters and retains casing.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: typing.List[int]) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → typing.List[int][source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[typing.List[int]]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[typing.List[int]]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[typing.List[int]]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.elmo_indexer.ELMoCharacterMapper[source]

Bases: object

Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.

beginning_of_sentence_character = 256
beginning_of_sentence_characters = [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
beginning_of_word_character = 258
bos_token = '<S>'
static convert_word_to_char_ids(word: str) → typing.List[int][source]
end_of_sentence_character = 257
end_of_sentence_characters = [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
end_of_word_character = 259
eos_token = '</S>'
max_word_length = 50
padding_character = 260
class allennlp.data.token_indexers.elmo_indexer.ELMoTokenCharactersIndexer(namespace: str = 'elmo_characters') → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Convert a token to an array of character ids to compute ELMo representations.

Parameters:
namespace : str, optional (default=``elmo_characters``)
count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: typing.List[int]) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → typing.List[int][source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[typing.List[int]]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[typing.List[int]]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[typing.List[int]]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.OpenaiTransformerBytePairIndexer(encoder: typing.Dict[str, int] = None, byte_pairs: typing.List[typing.Tuple[str, str]] = None, n_ctx: int = 512, model_path: str = None) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Generates the indices for the byte-pair encoding used by the OpenAI transformer language model: https://blog.openai.com/language-unsupervised/

This is unlike most of our TokenIndexers in that its indexing is not based on a Vocabulary but on a fixed set of mappings that are loaded by the constructor.

byte_pair_encode(token: allennlp.data.tokenizers.token.Token, lowercase: bool = True) → typing.List[str][source]
count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: typing.Dict[str, typing.List[int]], desired_num_tokens: typing.Dict[str, int], padding_lengths: typing.Dict[str, int]) → typing.Dict[str, typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: typing.List[allennlp.data.tokenizers.token.Token], _vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → typing.Dict[str, typing.List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.