allennlp.data.token_indexers

A TokenIndexer determines how string tokens get represented as arrays of indices in a model.

class allennlp.data.token_indexers.token_indexer.TokenIndexer[source]

Bases: typing.Generic, allennlp.common.registrable.Registrable

A TokenIndexer determines how string tokens get represented as arrays of indices in a model. This class both converts strings into numerical values, with the help of a Vocabulary, and it produces actual arrays.

Tokens can be represented as single IDs (e.g., the word “cat” gets represented by the number 34), or as lists of character IDs (e.g., “cat” gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

default_implementation = 'single_id'
get_keys(index_name: str) → List[str][source]

Return a list of the keys this indexer return from tokens_to_indices.

get_padding_lengths(token: TokenType) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → TokenType[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[TokenType]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[TokenType]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[TokenType]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.dep_label_indexer.DepLabelIndexer(namespace: str = 'dep_labels')[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their syntactic dependency label, as determined by the dep_ field on Token.

Parameters:
namespace : str, optional (default=``dep_labels``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.ner_tag_indexer.NerTagIndexer(namespace: str = 'ner_tokens')[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their entity type (i.e., their NER tag), as determined by the ent_type_ field on Token.

Parameters:
namespace : str, optional (default=``ner_tokens``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.pos_tag_indexer.PosTagIndexer(namespace: str = 'pos_tokens', coarse_tags: bool = False)[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their part of speech tag, as determined by the pos_ or tag_ fields on Token (corresponding to spacy’s coarse-grained and fine-grained POS tags, respectively).

Parameters:
namespace : str, optional (default=``pos_tokens``)

We will use this namespace in the Vocabulary to map strings to indices.

coarse_tags : bool, optional (default=``False``)

If True, we will use coarse POS tags instead of the default fine-grained POS tags.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer(namespace: str = 'tokens', lowercase_tokens: bool = False, start_tokens: List[str] = None, end_tokens: List[str] = None)[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as single integers.

Parameters:
namespace : str, optional (default=``tokens``)

We will use this namespace in the Vocabulary to map strings to indices.

lowercase_tokens : bool, optional (default=``False``)

If True, we will call token.lower() before getting an index for the token from the vocabulary.

start_tokens : List[str], optional (default=``None``)

These are prepended to the tokens provided to tokens_to_indices.

end_tokens : List[str], optional (default=``None``)

These are appended to the tokens provided to tokens_to_indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer(namespace: str = 'token_characters', character_tokenizer: allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer = <allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer object>, start_tokens: List[str] = None, end_tokens: List[str] = None, min_padding_length: int = 0)[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as lists of character indices.

Parameters:
namespace : str, optional (default=``token_characters``)

We will use this namespace in the Vocabulary to map the characters in each token to indices.

character_tokenizer : CharacterTokenizer, optional (default=``CharacterTokenizer()``)

We use a CharacterTokenizer to handle splitting tokens into characters, as it has options for byte encoding and other things. The default here is to instantiate a CharacterTokenizer with its default parameters, which uses unicode characters and retains casing.

start_tokens : List[str], optional (default=``None``)

These are prepended to the tokens provided to tokens_to_indices.

end_tokens : List[str], optional (default=``None``)

These are appended to the tokens provided to tokens_to_indices.

min_padding_length: ``int``, optional (default=``0``)

We use this value as the minimum length of padding. Usually used with :class:CnnEncoder, its value should be set to the maximum value of ngram_filter_sizes correspondingly.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: List[int]) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → List[int][source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[List[int]]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.elmo_indexer.ELMoCharacterMapper(tokens_to_add: Dict[str, int] = None)[source]

Bases: object

Maps individual tokens to sequences of character ids, compatible with ELMo. To be consistent with previously trained models, we include it here as special of existing character indexers.

We allow to add optional additional special tokens with designated character ids with tokens_to_add.

beginning_of_sentence_character = 256
beginning_of_sentence_characters = [258, 256, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
beginning_of_word_character = 258
bos_token = '<S>'
convert_word_to_char_ids(word: str) → List[int][source]
end_of_sentence_character = 257
end_of_sentence_characters = [258, 257, 259, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260, 260]
end_of_word_character = 259
eos_token = '</S>'
max_word_length = 50
padding_character = 260
class allennlp.data.token_indexers.elmo_indexer.ELMoTokenCharactersIndexer(namespace: str = 'elmo_characters', tokens_to_add: Dict[str, int] = None)[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Convert a token to an array of character ids to compute ELMo representations.

Parameters:
namespace : str, optional (default=``elmo_characters``)
tokens_to_add : Dict[str, int], optional (default=``None``)

If not None, then provides a mapping of special tokens to character ids. When using pre-trained models, then the character id must be less then 261, and we recommend using un-used ids (e.g. 1-32).

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: List[int]) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → List[int][source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[List[int]]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[List[int]]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[List[int]]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

class allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.OpenaiTransformerBytePairIndexer(encoder: Dict[str, int] = None, byte_pairs: List[Tuple[str, str]] = None, n_ctx: int = 512, model_path: str = None, namespace: str = 'openai_transformer', tokens_to_add: List[str] = None)[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

Generates the indices for the byte-pair encoding used by the OpenAI transformer language model: https://blog.openai.com/language-unsupervised/

This is unlike most of our TokenIndexers in that its indexing is not based on a Vocabulary but on a fixed set of mappings that are loaded by the constructor.

Note: recommend using OpenAISplitter tokenizer with this indexer, as it applies the same text normalization as the original implementation.

Note 2: when tokens_to_add is not None, be sure to set n_special=len(tokens_to_add) in OpenaiTransformer, otherwise behavior is undefined.

byte_pair_encode(token: allennlp.data.tokenizers.token.Token, lowercase: bool = True) → List[str][source]
count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.

allennlp.data.token_indexers.openai_transformer_byte_pair_indexer.text_standardize(text)[source]

Apply text standardization following original implementation.

class allennlp.data.token_indexers.wordpiece_indexer.PretrainedBertIndexer(pretrained_model: str, use_starting_offsets: bool = False, do_lowercase: bool = True, never_lowercase: List[str] = None, max_pieces: int = 512)[source]

Bases: allennlp.data.token_indexers.wordpiece_indexer.WordpieceIndexer

A TokenIndexer corresponding to a pretrained BERT model.

Parameters:
pretrained_model: ``str``

Either the name of the pretrained model to use (e.g. ‘bert-base-uncased’), or the path to the .txt file with its vocabulary.

If the name is a key in the list of pretrained models at https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/tokenization.py#L33 the corresponding path will be used; otherwise it will be interpreted as a path or URL.

use_starting_offsets: bool, optional (default: False)

By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If use_starting_offsets is specified, they will instead correspond to the first wordpiece in each word.

do_lowercase: ``bool``, optional (default = True)

Whether to lowercase the tokens before converting to wordpiece ids.

never_lowercase: ``List[str]``, optional

Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].

max_pieces: int, optional (default: 512)

The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Currently any inputs longer than this will be truncated. If this behavior is undesirable to you, you should consider filtering them out in your dataset reader.

class allennlp.data.token_indexers.wordpiece_indexer.WordpieceIndexer(vocab: Dict[str, int], wordpiece_tokenizer: Callable[str, List[str]], namespace: str = 'wordpiece', use_starting_offsets: bool = False, max_pieces: int = 512, do_lowercase: bool = False, never_lowercase: List[str] = None, start_tokens: List[str] = None, end_tokens: List[str] = None, separator_token: str = '[SEP]')[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

A token indexer that does the wordpiece-tokenization (e.g. for BERT embeddings). If you are using one of the pretrained BERT models, you’ll want to use the PretrainedBertIndexer subclass rather than this base class.

Parameters:
vocab : Dict[str, int]

The mapping {wordpiece -> id}. Note this is not an AllenNLP Vocabulary.

wordpiece_tokenizer : Callable[[str], List[str]]

A function that does the actual tokenization.

namespace : str, optional (default: “wordpiece”)

The namespace in the AllenNLP Vocabulary into which the wordpieces will be loaded.

use_starting_offsets : bool, optional (default: False)

By default, the “offsets” created by the token indexer correspond to the last wordpiece in each word. If use_starting_offsets is specified, they will instead correspond to the first wordpiece in each word.

max_pieces : int, optional (default: 512)

The BERT embedder uses positional embeddings and so has a corresponding maximum length for its input ids. Currently any inputs longer than this will be truncated. If this behavior is undesirable to you, you should consider filtering them out in your dataset reader.

do_lowercase : bool, optional (default=``False``)

Should we lowercase the provided tokens before getting the indices? You would need to do this if you are using an -uncased BERT model but your DatasetReader is not lowercasing tokens (which might be the case if you’re also using other embeddings based on cased tokens).

never_lowercase: ``List[str]``, optional

Tokens that should never be lowercased. Default is [‘[UNK]’, ‘[SEP]’, ‘[PAD]’, ‘[CLS]’, ‘[MASK]’].

start_tokens : List[str], optional (default=``None``)

These are prepended to the tokens provided to tokens_to_indices.

end_tokens : List[str], optional (default=``None``)

These are appended to the tokens provided to tokens_to_indices.

separator_token : str, optional (default=``[SEP]``)

This token indicates the segments in the sequence.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: Dict[str, Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_keys(index_name: str) → List[str][source]

We need to override this because the indexer generates multiple keys.

get_padding_lengths(token: int) → Dict[str, int][source]

This method returns a padding dictionary for the given token that specifies lengths for all arrays that need padding. For example, for single ID tokens the returned dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by tokens_to_indices().

pad_token_sequence(tokens: Dict[str, List[int]], desired_num_tokens: Dict[str, int], padding_lengths: Dict[str, int]) → Dict[str, List[int]][source]

This method pads a list of tokens to desired_num_tokens and returns a padded copy of the input tokens. If the input token list is longer than desired_num_tokens then it will be truncated.

padding_lengths is used to provide supplemental padding parameters which are needed in some cases. For example, it contains the widths to pad characters to when doing character-level padding.

tokens_to_indices(tokens: List[allennlp.data.tokenizers.token.Token], vocabulary: allennlp.data.vocabulary.Vocabulary, index_name: str) → Dict[str, List[int]][source]

Takes a list of tokens and converts them to one or more sets of indices. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices.