allennlp.data.token_indexers

A TokenIndexer determines how string tokens get represented as arrays of indices in a model.

class allennlp.data.token_indexers.token_indexer.TokenIndexer[source]

Bases: typing.Generic, allennlp.common.registrable.Registrable

A TokenIndexer determines how string tokens get represented as arrays of indices in a model. This class both converts strings into numerical values, with the help of a Vocabulary, and it produces actual arrays.

Tokens can be represented as single IDs (e.g., the word “cat” gets represented by the number 34), or as lists of character IDs (e.g., “cat” gets represented by the numbers [23, 10, 18]), or in some other way that you can come up with (e.g., if you have some structured input you want to represent in a special way in your data arrays, you can do that here).

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

default_implementation = 'single_id'
classmethod dict_from_params(params: allennlp.common.params.Params) → typing.Dict[str, allennlp.data.token_indexers.token_indexer.TokenIndexer][source]

We typically use TokenIndexers in a dictionary, with each TokenIndexer getting a name. The specification for this in a Params object is typically {"name" -> {indexer_params}}. This method reads that whole set of parameters and returns a dictionary suitable for use in a TextField.

Because default values for token indexers are typically handled in the calling class to this and are based on checking for None, if there were no parameters specifying any token indexers in the given params, we return None instead of an empty dictionary.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.token_indexer.TokenIndexer[source]
get_padding_lengths(token: TokenType) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → TokenType[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[TokenType], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[TokenType][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → TokenType[source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.

class allennlp.data.token_indexers.dep_label_indexer.DepLabelIndexer(namespace: str = 'dep_labels') → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their syntactic dependency label, as determined by the dep_ field on Token.

Parameters:

namespace : str, optional (default=``dep_labels``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.dep_label_indexer.DepLabelIndexer[source]
get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[int], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[int][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → int[source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.

class allennlp.data.token_indexers.ner_tag_indexer.NerTagIndexer(namespace: str = 'ner_tags') → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their entity type (i.e., their NER tag), as determined by the ent_type_ field on Token.

Parameters:

namespace : str, optional (default=``ner_tags``)

We will use this namespace in the Vocabulary to map strings to indices.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.ner_tag_indexer.NerTagIndexer[source]
get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[int], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[int][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → int[source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.

class allennlp.data.token_indexers.pos_tag_indexer.PosTagIndexer(namespace: str = 'pos_tags', coarse_tags: bool = False) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens by their part of speech tag, as determined by the pos_ or tag_ fields on Token (corresponding to spacy’s coarse-grained and fine-grained POS tags, respectively).

Parameters:

namespace : str, optional (default=``pos_tags``)

We will use this namespace in the Vocabulary to map strings to indices.

coarse_tags : bool, optional (default=``False``)

If True, we will use coarse POS tags instead of the default fine-grained POS tags.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.pos_tag_indexer.PosTagIndexer[source]
get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[int], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[int][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → int[source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.

class allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer(namespace: str = 'tokens', lowercase_tokens: bool = False) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as single integers.

Parameters:

namespace : str, optional (default=``tokens``)

We will use this namespace in the Vocabulary to map strings to indices.

lowercase_tokens : bool, optional (default=``False``)

If True, we will call token.lower() before getting an index for the token from the vocabulary.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.single_id_token_indexer.SingleIdTokenIndexer[source]
get_padding_lengths(token: int) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → int[source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[int], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[int][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → int[source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.

class allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer(namespace: str = 'token_characters', character_tokenizer: allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer = <allennlp.data.tokenizers.character_tokenizer.CharacterTokenizer object>) → None[source]

Bases: allennlp.data.token_indexers.token_indexer.TokenIndexer

This TokenIndexer represents tokens as lists of character indices.

Parameters:

namespace : str, optional (default=``token_characters``)

We will use this namespace in the Vocabulary to map the characters in each token to indices.

character_tokenizer : CharacterTokenizer, optional (default=``CharacterTokenizer()``)

We use a CharacterTokenizer to handle splitting tokens into characters, as it has options for byte encoding and other things. The default here is to instantiate a CharacterTokenizer with its default parameters, which uses unicode characters and retains casing.

count_vocab_items(token: allennlp.data.tokenizers.token.Token, counter: typing.Dict[str, typing.Dict[str, int]])[source]

The Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

classmethod from_params(params: allennlp.common.params.Params) → allennlp.data.token_indexers.token_characters_indexer.TokenCharactersIndexer[source]
Parameters:

namespace : str, optional (default=``token_characters``)

We will use this namespace in the Vocabulary to map the characters in each token to indices.

character_tokenizer : Params, optional (default=``Params({})``)

We use a CharacterTokenizer to handle splitting tokens into characters, as it has options for byte encoding and other things. These parameters get passed to the character tokenizer. The default is to use unicode characters and to retain casing.

get_padding_lengths(token: typing.List[int]) → typing.Dict[str, int][source]

This method returns a padding dictionary for the given token. For single ID tokens, e.g., this dictionary will be empty, but for a token characters representation, this will return the number of characters in the token.

get_padding_token() → typing.List[int][source]

When we need to add padding tokens, what should they look like? This method returns a “blank” token of whatever type is returned by token_to_indices().

pad_token_sequence(tokens: typing.List[typing.List[int]], desired_num_tokens: int, padding_lengths: typing.Dict[str, int]) → typing.List[typing.List[int]][source]

This method pads a list of tokens to desired_num_tokens, including any necessary internal padding using whatever lengths are relevant in padding_lengths, returning a padded copy of the input list. If each token is a single ID, this just adds 0 to the sequence (or truncates the sequence, if necessary). If each token is, e.g., a list of characters, this method will pad both the characters and the number of tokens.

token_to_indices(token: allennlp.data.tokenizers.token.Token, vocabulary: allennlp.data.vocabulary.Vocabulary) → typing.List[int][source]

Takes a string token and converts it into indices in some fashion. This could be returning an ID for the token from the vocabulary, or it could be splitting the token into characters and return a list of IDs for each character from the vocabulary, or something else.