A TokenEmbedder is a Module that embeds one-hot-encoded tokens as vectors.

class allennlp.modules.token_embedders.token_embedder.TokenEmbedder[source]

Bases: torch.nn.modules.module.Module, allennlp.common.registrable.Registrable

A TokenEmbedder is a Module that takes as input a tensor with integer ids that have been output from a TokenIndexer and outputs a vector per token in the input. The input typically has shape (batch_size, num_tokens) or (batch_size, num_tokens, num_characters), and the output is of shape (batch_size, num_tokens, output_dim). The simplest TokenEmbedder is just an embedding layer, but for character-level input, it could also be some kind of character encoder.

We add a single method to the basic Module API: get_output_dim(). This lets us more easily compute output dimensions for the TextFieldEmbedder, which we might need when defining model parameters such as LSTMs or linear layers, which need to know their input dimension before the layers are called.

default_implementation = 'embedding'
get_output_dim() → int[source]

Returns the final output dimension that this TokenEmbedder uses to represent each token. This is not the shape of the returned tensor, but the last element of that shape.

class allennlp.modules.token_embedders.embedding.Embedding(num_embeddings: int, embedding_dim: int, projection_dim: int = None, weight: torch.FloatTensor = None, padding_index: int = None, trainable: bool = True, max_norm: float = None, norm_type: float = 2.0, scale_grad_by_freq: bool = False, sparse: bool = False, vocab_namespace: str = None, pretrained_file: str = None) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

A more featureful embedding module than the default in Pytorch. Adds the ability to:

  1. embed higher-order inputs
  2. pre-specify the weight matrix
  3. use a non-trainable embedding
  4. project the resultant embeddings to some other dimension (which only makes sense with non-trainable embeddings).
  5. build all of this easily from_params

Note that if you are using our data API and are trying to embed a TextField, you should use a TextFieldEmbedder instead of using this directly.

num_embeddings : int:

Size of the dictionary of embeddings (vocabulary size).

embedding_dim : int

The size of each embedding vector.

projection_dim : int, (optional, default=None)

If given, we add a projection layer after the embedding layer. This really only makes sense if trainable is False.

weight : torch.FloatTensor, (optional, default=None)

A pre-initialised weight matrix for the embedding lookup, allowing the use of pretrained vectors.

padding_index : int, (optional, default=None)

If given, pads the output with zeros whenever it encounters the index.

trainable : bool, (optional, default=True)

Whether or not to optimize the embedding parameters.

max_norm : float, (optional, default=None)

If given, will renormalize the embeddings to always have a norm lesser than this

norm_type : float, (optional, default=2):

The p of the p-norm to compute for the max_norm option

scale_grad_by_freq : boolean, (optional, default=False):

If given, this will scale gradients by the frequency of the words in the mini-batch.

sparse : bool, (optional, default=False):

Whether or not the Pytorch backend should use a sparse representation of the embedding weight.

vocab_namespace : str, (optional, default=None):

In case of fine-tuning/transfer learning, the model’s embedding matrix needs to be extended according to the size of extended-vocabulary. To be able to know how much to extend the embedding-matrix, it’s necessary to know which vocab_namspace was used to construct it in the original training. We store vocab_namespace used during the original training as an attribute, so that it can be retrieved during fine-tuning.

An Embedding module.
extend_vocab(extended_vocab:, vocab_namespace: str = None, extension_pretrained_file: str = None, model_path: str = None)[source]

Extends the embedding matrix according to the extended vocabulary. If extension_pretrained_file is available, it will be used for initializing the new words embeddings in the extended vocabulary; otherwise we will check if _pretrained_file attribute is already available. If none is available, they will be initialized with xavier uniform.

extended_vocab : Vocabulary:

Vocabulary extended from original vocabulary used to construct this Embedding.

vocab_namespace : str, (optional, default=None)

In case you know what vocab_namespace should be used for extension, you can pass it. If not passed, it will check if vocab_namespace used at the time of Embedding construction is available. If so, this namespace will be used or else extend_vocab will be a no-op.

extension_pretrained_file : str, (optional, default=None)

A file containing pretrained embeddings can be specified here. It can be the path to a local file or an URL of a (cached) remote file. Check format details in from_params of Embedding class.

model_path : str, (optional, default=None)

Path traversing the model attributes upto this embedding module. Eg. “_text_field_embedder.token_embedder_tokens”. This is only useful to give helpful error message when extend_vocab is implicitly called by fine-tune or any other command.


Defines the computation performed at every call.

Should be overridden by all subclasses.


Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_params(vocab:, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.embedding.Embedding[source]

We need the vocabulary here to know how many items we need to embed, and we look for a vocab_namespace key in the parameter dictionary to know which vocabulary to use. If you know beforehand exactly how many embeddings you need, or aren’t using a vocabulary mapping for the things getting embedded here, then you can pass in the num_embeddings key directly, and the vocabulary will be ignored.

In the configuration file, a file containing pretrained embeddings can be specified using the parameter "pretrained_file". It can be the path to a local file or an URL of a (cached) remote file. Two formats are supported:

  • hdf5 file - containing an embedding matrix in the form of a torch.Tensor;

  • text file - an utf-8 encoded text file with space separated fields:

    [word] [dim 1] [dim 2] ...

    The text file can eventually be compressed with gzip, bz2, lzma or zip. You can even select a single file inside an archive containing multiple files using the URI:


    where archive_uri can be a file system path or a URL. For example:

get_output_dim() → int[source]

Returns the final output dimension that this TokenEmbedder uses to represent each token. This is not the shape of the returned tensor, but the last element of that shape.

class allennlp.modules.token_embedders.embedding.EmbeddingsFileURI(main_file_uri, path_inside_archive)[source]

Bases: tuple


Alias for field number 0


Alias for field number 1

class allennlp.modules.token_embedders.embedding.EmbeddingsTextFile(file_uri: str, encoding: str = 'utf-8', cache_dir: str = None) → None[source]

Bases: typing.Iterator

Utility class for opening embeddings text files. Handles various compression formats, as well as context management.

file_uri: str

It can be:

  • a file system path or a URL of an eventually compressed text file or a zip/tar archive containing a single file.
  • URI of the type (archive_path_or_url)#file_path_inside_archive if the text file is contained in a multi-file archive.
encoding: str
cache_dir: str
close() → None[source]
read() → str[source]
readline() → str[source]
allennlp.modules.token_embedders.embedding.format_embeddings_file_uri(main_file_path_or_url: str, path_inside_archive: typing.Union[str, NoneType] = None) → str[source]
allennlp.modules.token_embedders.embedding.parse_embeddings_file_uri(uri: str) → allennlp.modules.token_embedders.embedding.EmbeddingsFileURI[source]
class allennlp.modules.token_embedders.token_characters_encoder.TokenCharactersEncoder(embedding: allennlp.modules.token_embedders.embedding.Embedding, encoder: allennlp.modules.seq2vec_encoders.seq2vec_encoder.Seq2VecEncoder, dropout: float = 0.0) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

A TokenCharactersEncoder takes the output of a TokenCharactersIndexer, which is a tensor of shape (batch_size, num_tokens, num_characters), embeds the characters, runs a token-level encoder, and returns the result, which is a tensor of shape (batch_size, num_tokens, encoding_dim). We also optionally apply dropout after the token-level encoder.

We take the embedding and encoding modules as input, so this class is itself quite simple.

forward(token_characters: torch.Tensor) → torch.Tensor[source]
classmethod from_params(vocab:, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.token_characters_encoder.TokenCharactersEncoder[source]
get_output_dim() → int[source]
class allennlp.modules.token_embedders.elmo_token_embedder.ElmoTokenEmbedder(options_file: str, weight_file: str, do_layer_norm: bool = False, dropout: float = 0.5, requires_grad: bool = False, projection_dim: int = None, vocab_to_cache: typing.List[str] = None, scalar_mix_parameters: typing.List[float] = None) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Compute a single layer of ELMo representations.

This class serves as a convenience when you only want to use one layer of ELMo representations at the input of your network. It’s essentially a wrapper around Elmo(num_output_representations=1, ...)

options_file : str, required.

An ELMo JSON options file.

weight_file : str, required.

An ELMo hdf5 weight file.

do_layer_norm : bool, optional.

Should we apply layer normalization (passed to ScalarMix)?

dropout : float, optional.

The dropout value to be applied to the ELMo representations.

requires_grad : bool, optional

If True, compute gradient of ELMo parameters for fine tuning.

projection_dim : int, optional

If given, we will project the ELMo embedding down to this dimension. We recommend that you try using ELMo with a lot of dropout and no projection first, but we have found a few cases where projection helps (particularly where there is very limited training data).

vocab_to_cache : List[str], optional, (default = 0.5).

A list of words to pre-compute and cache character convolutions for. If you use this option, the ElmoTokenEmbedder expects that you pass word indices of shape (batch_size, timesteps) to forward, instead of character indices. If you use this option and pass a word which wasn’t pre-cached, this will break.

scalar_mix_parameters : List[int], optional, (default=None)

If not None, use these scalar mix parameters to weight the representations produced by different layers. These mixing weights are not updated during training.

forward(inputs: torch.Tensor, word_inputs: torch.Tensor = None) → torch.Tensor[source]
inputs: ``torch.Tensor``

Shape (batch_size, timesteps, 50) of character ids representing the current batch.

word_inputs : torch.Tensor, optional.

If you passed a cached vocab, you can in addition pass a tensor of shape (batch_size, timesteps), which represent word ids which have been pre-cached.

The ELMo representations for the input sequence, shape
``(batch_size, timesteps, embedding_dim)``
classmethod from_params(vocab:, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.elmo_token_embedder.ElmoTokenEmbedder[source]
get_output_dim() → int[source]
class allennlp.modules.token_embedders.openai_transformer_embedder.OpenaiTransformerEmbedder(transformer: allennlp.modules.openai_transformer.OpenaiTransformer, top_layer_only: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Takes a byte-pair representation of a batch of sentences (as produced by the OpenaiTransformerBytePairIndexer) and generates a ScalarMix of the corresponding contextual embeddings.

transformer: ``OpenaiTransformer``, required.

The OpenaiTransformer module used for the embeddings.

top_layer_only: ``bool``, optional (default = ``False``)

If True, then only return the top layer instead of apply the scalar mix.

forward(inputs: torch.Tensor, offsets: torch.Tensor = None) → torch.Tensor[source]
inputs: ``torch.Tensor``, required

A (batch_size, num_timesteps) tensor representing the byte-pair encodings for the current batch.

offsets: ``torch.Tensor``, required

A (batch_size, max_sequence_length) tensor representing the word offsets for the current batch.


An embedding representation of the input sequence having shape (batch_size, sequence_length, embedding_dim)


The last dimension of the output, not the shape.

A TokenEmbedder which uses one of the BERT models ( to produce embeddings.

At its core it uses Hugging Face’s PyTorch implementation (, so thanks to them!

class allennlp.modules.token_embedders.bert_token_embedder.BertEmbedder(bert_model: pytorch_pretrained_bert.modeling.BertModel, top_layer_only: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

A TokenEmbedder that produces BERT embeddings for your tokens. Should be paired with a BertIndexer, which produces wordpiece ids.

Most likely you probably want to use PretrainedBertEmbedder for one of the named pretrained models, not this base class.

bert_model: ``BertModel``

The BERT model being wrapped.

top_layer_only: ``bool``, optional (default = ``False``)

If True, then only return the top layer instead of apply the scalar mix.

forward(input_ids: torch.LongTensor, offsets: torch.LongTensor = None, token_type_ids: torch.LongTensor = None) → torch.Tensor[source]
input_ids : torch.LongTensor

The (batch_size, ..., max_sequence_length) tensor of wordpiece ids.

offsets : torch.LongTensor, optional

The BERT embeddings are one per wordpiece. However it’s possible/likely you might want one per original token. In that case, offsets represents the indices of the desired wordpiece for each original token. Depending on how your token indexer is configured, this could be the position of the last wordpiece for each token, or it could be the position of the first wordpiece for each token.

For example, if you had the sentence “Definitely not”, and if the corresponding wordpieces were [“Def”, “##in”, “##ite”, “##ly”, “not”], then the input_ids would be 5 wordpiece ids, and the “last wordpiece” offsets would be [3, 4]. If offsets are provided, the returned tensor will contain only the wordpiece embeddings at those positions, and (in particular) will contain one embedding per token. If offsets are not provided, the entire tensor of wordpiece embeddings will be returned.

token_type_ids : torch.LongTensor, optional

If an input consists of two sentences (as in the BERT paper), tokens from the first sentence should have type 0 and tokens from the second sentence should have type 1. If you don’t provide this (the default BertIndexer doesn’t) then it’s assumed to be all 0s.

get_output_dim() → int[source]
class allennlp.modules.token_embedders.bert_token_embedder.PretrainedBertEmbedder(pretrained_model: str, requires_grad: bool = False, top_layer_only: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.bert_token_embedder.BertEmbedder

pretrained_model: ``str``

Either the name of the pretrained model to use (e.g. ‘bert-base-uncased’), or the path to the .tar.gz file with the model weights.

If the name is a key in the list of pretrained models at the corresponding path will be used; otherwise it will be interpreted as a path or URL.

requires_grad : bool, optional (default = False)

If True, compute gradient of BERT parameters for fine tuning.

top_layer_only: ``bool``, optional (default = ``False``)

If True, then only return the top layer instead of apply the scalar mix.

class allennlp.modules.token_embedders.language_model_token_embedder.LanguageModelTokenEmbedder(archive_file: str, dropout: float = None, bos_eos_tokens: typing.Tuple[str, str] = ('<S>', '</S>'), remove_bos_eos: bool = True, requires_grad: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Compute a single layer of representations from a (optionally bidirectional) language model. This is done by computing a learned scalar average of the layers from the LM. Typically the LM’s weights will be fixed, but they can be fine tuned by setting requires_grad.

archive_file : str, required

An archive file, typically model.tar.gz, from a LanguageModel. The contextualizer used by the LM must satisfy two requirements:

  1. It must have a num_layers field.
  2. It must take a boolean return_all_layers parameter in its constructor.

See BidirectionalLanguageModelTransformer for their definitions.

dropout : float, optional.

The dropout value to be applied to the representations.

bos_eos_tokens : Tuple[str, str], optional (default=``(“<S>”, “</S>”)``)

These will be indexed and placed around the indexed tokens. Necessary if the language model was trained with them, but they were injected external to an indexer.

remove_bos_eos: ``bool``, optional (default: True)

Typically the provided token indexes will be augmented with begin-sentence and end-sentence tokens. (Alternatively, you can pass bos_eos_tokens.) If this flag is True the corresponding embeddings will be removed from the return values.

Warning: This only removes a single start and single end token!

requires_grad : bool, optional (default: False)

If True, compute gradient of bidirectional language model parameters for fine tuning.

forward(inputs: torch.Tensor) → typing.Dict[str, torch.Tensor][source]
inputs: ``torch.Tensor``

Shape (batch_size, timesteps, ...) of token ids representing the current batch. These must have been produced using the same indexer the LM was trained on.

The bidirectional language model representations for the input sequence, shape
``(batch_size, timesteps, embedding_dim)``
get_output_dim() → int[source]
class allennlp.modules.token_embedders.bag_of_word_counts_token_embedder.BagOfWordCountsTokenEmbedder(vocab:, vocab_namespace: str, projection_dim: int = None, ignore_oov: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Represents a sequence of tokens as a bag of (discrete) word ids, as it was done in the pre-neural days.

Each sequence gets a vector of length vocabulary size, where the i’th entry in the vector corresponds to number of times the i’th token in the vocabulary appears in the sequence.

By default, we ignore padding tokens.

vocab: ``Vocabulary``
vocab_namespace: ``str``

namespace of vocabulary to embed

projection_dim : int, optional (default = None)

if specified, will project the resulting bag of words representation to specified dimension.

ignore_oov : bool, optional (default = False)

If true, we ignore the OOV token.

forward(inputs: torch.Tensor) → torch.Tensor[source]
inputs: ``torch.Tensor``

Shape (batch_size, timesteps, sequence_length) of word ids representing the current batch.

The bag-of-words representations for the input sequence, shape
``(batch_size, vocab_size)``
classmethod from_params(vocab:, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.bag_of_word_counts_token_embedder.BagOfWordCountsTokenEmbedder[source]

we look for a vocab_namespace key in the parameter dictionary to know which vocabulary to use.