allennlp.modules.token_embedders

A TokenEmbedder is a Module that embeds one-hot-encoded tokens as vectors.

class allennlp.modules.token_embedders.token_embedder.TokenEmbedder[source]

Bases: torch.nn.modules.module.Module, allennlp.common.registrable.Registrable

A TokenEmbedder is a Module that takes as input a tensor with integer ids that have been output from a TokenIndexer and outputs a vector per token in the input. The input typically has shape (batch_size, num_tokens) or (batch_size, num_tokens, num_characters), and the output is of shape (batch_size, num_tokens, output_dim). The simplest TokenEmbedder is just an embedding layer, but for character-level input, it could also be some kind of character encoder.

We add a single method to the basic Module API: get_output_dim(). This lets us more easily compute output dimensions for the TextFieldEmbedder, which we might need when defining model parameters such as LSTMs or linear layers, which need to know their input dimension before the layers are called.

default_implementation = 'embedding'
get_output_dim() → int[source]

Returns the final output dimension that this TokenEmbedder uses to represent each token. This is not the shape of the returned tensor, but the last element of that shape.

class allennlp.modules.token_embedders.embedding.Embedding(num_embeddings: int, embedding_dim: int, projection_dim: int = None, weight: torch.FloatTensor = None, padding_index: int = None, trainable: bool = True, max_norm: float = None, norm_type: float = 2.0, scale_grad_by_freq: bool = False, sparse: bool = False) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

A more featureful embedding module than the default in Pytorch. Adds the ability to:

  1. embed higher-order inputs
  2. pre-specify the weight matrix
  3. use a non-trainable embedding
  4. project the resultant embeddings to some other dimension (which only makes sense with non-trainable embeddings).
  5. build all of this easily from_params

Note that if you are using our data API and are trying to embed a TextField, you should use a TextFieldEmbedder instead of using this directly.

Parameters:
num_embeddings : int:

Size of the dictionary of embeddings (vocabulary size).

embedding_dim : int

The size of each embedding vector.

projection_dim : int, (optional, default=None)

If given, we add a projection layer after the embedding layer. This really only makes sense if trainable is False.

weight : torch.FloatTensor, (optional, default=None)

A pre-initialised weight matrix for the embedding lookup, allowing the use of pretrained vectors.

padding_index : int, (optional, default=None)

If given, pads the output with zeros whenever it encounters the index.

trainable : bool, (optional, default=True)

Whether or not to optimize the embedding parameters.

max_norm : float, (optional, default=None)

If given, will renormalize the embeddings to always have a norm lesser than this

norm_type : float, (optional, default=2):

The p of the p-norm to compute for the max_norm option

scale_grad_by_freq : boolean, (optional, default=False):

If given, this will scale gradients by the frequency of the words in the mini-batch.

sparse : bool, (optional, default=False):

Whether or not the Pytorch backend should use a sparse representation of the embedding weight.

Returns:
An Embedding module.
forward(inputs)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

classmethod from_params(vocab: allennlp.data.vocabulary.Vocabulary, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.embedding.Embedding[source]

We need the vocabulary here to know how many items we need to embed, and we look for a vocab_namespace key in the parameter dictionary to know which vocabulary to use. If you know beforehand exactly how many embeddings you need, or aren’t using a vocabulary mapping for the things getting embedded here, then you can pass in the num_embeddings key directly, and the vocabulary will be ignored.

In the configuration file, a file containing pretrained embeddings can be specified using the parameter "pretrained_file". It can be the path to a local file or an URL of a (cached) remote file. Two formats are supported:

  • hdf5 file - containing an embedding matrix in the form of a torch.Tensor;

  • text file - an utf-8 encoded text file with space separated fields:

    [word] [dim 1] [dim 2] ...
    

    The text file can eventually be compressed with gzip, bz2, lzma or zip. You can even select a single file inside an archive containing multiple files using the URI:

    "(archive_uri)#file_path_inside_the_archive"
    

    where archive_uri can be a file system path or a URL. For example:

    "(http://nlp.stanford.edu/data/glove.twitter.27B.zip)#glove.twitter.27B.200d.txt"
    
get_output_dim() → int[source]

Returns the final output dimension that this TokenEmbedder uses to represent each token. This is not the shape of the returned tensor, but the last element of that shape.

class allennlp.modules.token_embedders.embedding.EmbeddingsFileURI(main_file_uri, path_inside_archive)[source]

Bases: tuple

main_file_uri

Alias for field number 0

path_inside_archive

Alias for field number 1

class allennlp.modules.token_embedders.embedding.EmbeddingsTextFile(file_uri: str, encoding: str = 'utf-8', cache_dir: str = None) → None[source]

Bases: typing.Iterator

Utility class for opening embeddings text files. Handles various compression formats, as well as context management.

Parameters:
file_uri: str

It can be:

  • a file system path or a URL of an eventually compressed text file or a zip/tar archive containing a single file.
  • URI of the type (archive_path_or_url)#file_path_inside_archive if the text file is contained in a multi-file archive.
encoding: str
cache_dir: str
DEFAULT_ENCODING = 'utf-8'
close() → None[source]
read() → str[source]
readline() → str[source]
allennlp.modules.token_embedders.embedding.format_embeddings_file_uri(main_file_path_or_url: str, path_inside_archive: typing.Union[str, NoneType] = None) → str[source]
allennlp.modules.token_embedders.embedding.parse_embeddings_file_uri(uri: str) → allennlp.modules.token_embedders.embedding.EmbeddingsFileURI[source]
class allennlp.modules.token_embedders.token_characters_encoder.TokenCharactersEncoder(embedding: allennlp.modules.token_embedders.embedding.Embedding, encoder: allennlp.modules.seq2vec_encoders.seq2vec_encoder.Seq2VecEncoder, dropout: float = 0.0) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

A TokenCharactersEncoder takes the output of a TokenCharactersIndexer, which is a tensor of shape (batch_size, num_tokens, num_characters), embeds the characters, runs a token-level encoder, and returns the result, which is a tensor of shape (batch_size, num_tokens, encoding_dim). We also optionally apply dropout after the token-level encoder.

We take the embedding and encoding modules as input, so this class is itself quite simple.

forward(token_characters: torch.Tensor) → torch.Tensor[source]
classmethod from_params(vocab: allennlp.data.vocabulary.Vocabulary, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.token_characters_encoder.TokenCharactersEncoder[source]
get_output_dim() → int[source]
class allennlp.modules.token_embedders.elmo_token_embedder.ElmoTokenEmbedder(options_file: str, weight_file: str, do_layer_norm: bool = False, dropout: float = 0.5, requires_grad: bool = False, projection_dim: int = None, vocab_to_cache: typing.List[str] = None) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Compute a single layer of ELMo representations.

This class serves as a convenience when you only want to use one layer of ELMo representations at the input of your network. It’s essentially a wrapper around Elmo(num_output_representations=1, ...)

Parameters:
options_file : str, required.

An ELMo JSON options file.

weight_file : str, required.

An ELMo hdf5 weight file.

do_layer_norm : bool, optional.

Should we apply layer normalization (passed to ScalarMix)?

dropout : float, optional.

The dropout value to be applied to the ELMo representations.

requires_grad : bool, optional

If True, compute gradient of ELMo parameters for fine tuning.

projection_dim : int, optional

If given, we will project the ELMo embedding down to this dimension. We recommend that you try using ELMo with a lot of dropout and no projection first, but we have found a few cases where projection helps (particulary where there is very limited training data).

vocab_to_cache : List[str], optional, (default = 0.5).

A list of words to pre-compute and cache character convolutions for. If you use this option, the ElmoTokenEmbedder expects that you pass word indices of shape (batch_size, timesteps) to forward, instead of character indices. If you use this option and pass a word which wasn’t pre-cached, this will break.

forward(inputs: torch.Tensor, word_inputs: torch.Tensor = None) → torch.Tensor[source]
Parameters:
inputs: ``torch.Tensor``

Shape (batch_size, timesteps, 50) of character ids representing the current batch.

word_inputs : torch.Tensor, optional.

If you passed a cached vocab, you can in addition pass a tensor of shape (batch_size, timesteps), which represent word ids which have been pre-cached.

Returns:
The ELMo representations for the input sequence, shape
``(batch_size, timesteps, embedding_dim)``
classmethod from_params(vocab: allennlp.data.vocabulary.Vocabulary, params: allennlp.common.params.Params) → allennlp.modules.token_embedders.elmo_token_embedder.ElmoTokenEmbedder[source]
get_output_dim()[source]
class allennlp.modules.token_embedders.openai_transformer_embedder.OpenaiTransformerEmbedder(transformer: allennlp.modules.openai_transformer.OpenaiTransformer) → None[source]

Bases: allennlp.modules.token_embedders.token_embedder.TokenEmbedder

Takes a byte-pair representation of a batch of sentences (as produced by the OpenaiTransformerBytePairIndexer) and generates a ScalarMix of the corresponding contextual embeddings.

Parameters:
transformer: ``OpenaiTransformer``, required.

The OpenaiTransformer module used for the embeddings.

forward(inputs: torch.Tensor, offsets: torch.Tensor) → torch.Tensor[source]
Parameters:
inputs: ``torch.Tensor``, required

A (batch_size, num_timesteps) tensor representing the byte-pair encodings for the current batch.

offsets: ``torch.Tensor``, required

A (batch_size, max_sequence_length) tensor representing the word offsets for the current batch.

Returns:
``[torch.Tensor]``

An embedding representation of the input sequence having shape (batch_size, sequence_length, embedding_dim)

get_output_dim()[source]

The last dimension of the output, not the shape.