allennlp.data.token_indexers.pretrained_transformer_mismatched_indexer#

PretrainedTransformerMismatchedIndexer#

PretrainedTransformerMismatchedIndexer(
    self,
    model_name: str,
    namespace: str = 'tags',
    max_length: int = None,
    kwargs,
) -> None

Use this indexer when (for whatever reason) you are not using a corresponding PretrainedTransformerTokenizer on your input. We assume that you used a tokenizer that splits strings into words, while the transformer expects wordpieces as input. This indexer splits the words into wordpieces and flattens them out. You should use the corresponding PretrainedTransformerMismatchedEmbedder to embed these wordpieces and then pull out a single vector for each original word.

Parameters

  • model_name : str The name of the transformers model to use.
  • namespace : str, optional (default=tags) We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of tags so that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.
  • max_length : int, optional (default = None) If positive, split the document into segments of this many tokens (including special tokens) before feeding into the embedder. The embedder embeds these segments independently and concatenate the results to get the original document representation. Should be set to the same value as the max_length option on the PretrainedTransformerMismatchedEmbedder.

as_padded_tensor_dict#

PretrainedTransformerMismatchedIndexer.as_padded_tensor_dict(
    self,
    tokens: Dict[str, List[Any]],
    padding_lengths: Dict[str, int],
) -> Dict[str, torch.Tensor]

This method pads a list of tokens given the input padding lengths (which could actually truncate things, depending on settings) and returns that padded list of input tokens as a Dict[str, torch.Tensor]. This is a dictionary because there should be one key per argument that the TokenEmbedder corresponding to this class expects in its forward() method (where the argument name in the TokenEmbedder needs to make the key in this dictionary).

The base class implements the case when all you want to do is create a padded LongTensor for every list in the tokens dictionary. If your TokenIndexer needs more complex logic than that, you need to override this method.

count_vocab_items#

PretrainedTransformerMismatchedIndexer.count_vocab_items(
    self,
    token: allennlp.data.tokenizers.token.Token,
    counter: Dict[str, Dict[str, int]],
)

The :class:Vocabulary needs to assign indices to whatever strings we see in the training data (possibly doing some frequency filtering and using an OOV, or out of vocabulary, token). This method takes a token and a dictionary of counts and increments counts for whatever vocabulary items are present in the token. If this is a single token ID representation, the vocabulary item is likely the token itself. If this is a token characters representation, the vocabulary items are all of the characters in the token.

get_empty_token_list#

PretrainedTransformerMismatchedIndexer.get_empty_token_list(
    self,
) -> Dict[str, List[Any]]

Returns an already indexed version of an empty token list. This is typically just an empty list for whatever keys are used in the indexer.

tokens_to_indices#

PretrainedTransformerMismatchedIndexer.tokens_to_indices(
    self,
    tokens: List[allennlp.data.tokenizers.token.Token],
    vocabulary: allennlp.data.vocabulary.Vocabulary,
) -> Dict[str, List[Any]]

Takes a list of tokens and converts them to an IndexedTokenList. This could be just an ID for each token from the vocabulary. Or it could split each token into characters and return one ID per character. Or (for instance, in the case of byte-pair encoding) there might not be a clean mapping from individual tokens to indices, and the IndexedTokenList could be a complex data structure.