PretrainedTransformerIndexer(self, model_name:str, namespace:str='tags', **kwargs) -> None
TokenIndexer assumes that Tokens already have their indexes in them (see
We still require
model_name because we want to form allennlp vocabulary from pretrained one.
Indexer is only really appropriate to use if you've also used a
PretrainedTransformerTokenizer to tokenize your input. Otherwise you'll
have a mismatch between your tokens and your vocabulary, and you'll get a lot of UNK tokens.
- model_name :
strThe name of the
transformersmodel to use.
- namespace :
str, optional (default=
tags) We will add the tokens in the pytorch_transformer vocabulary to this vocabulary namespace. We use a somewhat confusing default value of
tagsso that we do not add padding or UNK tokens to this namespace, which would break on loading because we wouldn't find our default OOV token.
PretrainedTransformerIndexer.count_vocab_items(self, token:allennlp.data.tokenizers.token.Token, counter:Dict[str, Dict[str, int]])
Vocabulary needs to assign indices to whatever strings we see in the training
data (possibly doing some frequency filtering and using an OOV, or out of vocabulary,
token). This method takes a token and a dictionary of counts and increments counts for
whatever vocabulary items are present in the token. If this is a single token ID
representation, the vocabulary item is likely the token itself. If this is a token
characters representation, the vocabulary items are all of the characters in the token.
PretrainedTransformerIndexer.tokens_to_indices(self, tokens:List[allennlp.data.tokenizers.token.Token], vocabulary:allennlp.data.vocabulary.Vocabulary) -> Dict[str, List[Any]]
Takes a list of tokens and converts them to an
This could be just an ID for each token from the vocabulary.
Or it could split each token into characters and return one ID per character.
Or (for instance, in the case of byte-pair encoding) there might not be a clean
mapping from individual tokens to indices, and the
IndexedTokenList could be a complex
PretrainedTransformerIndexer.get_empty_token_list(self) -> Dict[str, List[Any]]
already indexed version of an empty token list. This is typically just an
empty list for whatever keys are used in the indexer.