allennlp.modules.seq2seq_encoders

Modules that transform a sequence of input vectors into a sequence of output vectors. Some are just basic wrappers around existing PyTorch modules, others are AllenNLP modules.

The available Seq2Seq encoders are

class allennlp.modules.seq2seq_encoders.pytorch_seq2seq_wrapper.PytorchSeq2SeqWrapper(module: torch.nn.modules.module.Module, stateful: bool = False) → None[source]

Bases: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

Pytorch’s RNNs have two outputs: the hidden state for every time step, and the hidden state at the last time step for every layer. We just want the first one as a single output. This wrapper pulls out that output, and adds a get_output_dim() method, which is useful if you want to, e.g., define a linear + softmax layer on top of this to get some distribution over a set of labels. The linear layer needs to know its input dimension before it is called, and you can get that from get_output_dim.

In order to be wrapped with this wrapper, a class must have the following members:

  • self.input_size: int
  • self.hidden_size: int
  • def forward(inputs: PackedSequence, hidden_state: torch.Tensor) -> Tuple[PackedSequence, torch.Tensor].
  • self.bidirectional: bool (optional)

This is what pytorch’s RNN’s look like - just make sure your class looks like those, and it should work.

Note that we require you to pass a binary mask of shape (batch_size, sequence_length) when you call this module, to avoid subtle bugs around masking. If you already have a PackedSequence you can pass None as the second parameter.

We support stateful RNNs where the final state from each batch is used as the initial state for the subsequent batch by passing stateful=True to the constructor.

forward(inputs: torch.Tensor, mask: torch.Tensor, hidden_state: torch.Tensor = None) → torch.Tensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim() → int[source]

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.

get_output_dim() → int[source]

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.

is_bidirectional() → bool[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.

class allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder(stateful: bool = False) → None[source]

Bases: allennlp.modules.encoder_base._EncoderBase, allennlp.common.registrable.Registrable

A Seq2SeqEncoder is a Module that takes as input a sequence of vectors and returns a modified sequence of vectors. Input shape: (batch_size, sequence_length, input_dim); output shape: (batch_size, sequence_length, output_dim).

We add two methods to the basic Module API: get_input_dim() and get_output_dim(). You might need this if you want to construct a Linear layer using the output of this encoder, or to raise sensible errors for mis-matching input dimensions.

get_input_dim() → int[source]

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.

get_output_dim() → int[source]

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.

is_bidirectional() → bool[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.

class allennlp.modules.seq2seq_encoders.intra_sentence_attention.IntraSentenceAttentionEncoder(input_dim: int, projection_dim: int = None, similarity_function: allennlp.modules.similarity_functions.similarity_function.SimilarityFunction = DotProductSimilarity(), num_attention_heads: int = 1, combination: str = '1, 2', output_dim: int = None) → None[source]

Bases: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

An IntraSentenceAttentionEncoder is a Seq2SeqEncoder that merges the original word representations with an attention (for each word) over other words in the sentence. As a Seq2SeqEncoder, the input to this module is of shape (batch_size, num_tokens, input_dim), and the output is of shape (batch_size, num_tokens, output_dim).

We compute the attention using a configurable SimilarityFunction, which could have multiple attention heads. The operation for merging the original representations with the attended representations is also configurable (e.g., you can concatenate them, add them, multiply them, etc.).

Parameters:
input_dim : int

The dimension of the vector for each element in the input sequence; input_tensor.size(-1).

projection_dim : int, optional

If given, we will do a linear projection of the input sequence to this dimension before performing the attention-weighted sum.

similarity_function : SimilarityFunction, optional

The similarity function to use when computing attentions. Default is to use a dot product.

num_attention_heads: ``int``, optional

If this is greater than one (default is 1), we will split the input into several “heads” to compute multi-headed weighted sums. Must be used with a multi-headed similarity function, and you almost certainly want to do a projection in conjunction with the multiple heads.

combination : str, optional

This string defines how we merge the original word representations with the result of the intra-sentence attention. This will be passed to combine_tensors(); see that function for more detail on exactly how this works, but some simple examples are "1,2" for concatenation (the default), "1+2" for adding the two, or "2" for only keeping the attention representation.

output_dim : bool, optional (default = None)

The dimension of an optional output projection.

forward(tokens: torch.Tensor, mask: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim() → int[source]

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.

get_output_dim() → int[source]

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.

is_bidirectional()[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.

class allennlp.modules.seq2seq_encoders.stacked_self_attention.StackedSelfAttentionEncoder(input_dim: int, hidden_dim: int, projection_dim: int, feedforward_hidden_dim: int, num_layers: int, num_attention_heads: int, use_positional_encoding: bool = True, dropout_prob: float = 0.1, residual_dropout_prob: float = 0.2, attention_dropout_prob: float = 0.1) → None[source]

Bases: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

Implements a stacked self-attention encoder similar to the Transformer architecture in Attention is all you Need .

This encoder combines 3 layers in a ‘block’:

  1. A 2 layer FeedForward network.
  2. Multi-headed self attention, which uses 2 learnt linear projections to perform a dot-product similarity between every pair of elements scaled by the square root of the sequence length.
  3. Layer Normalisation.

These are then stacked into num_layers layers.

Parameters:
input_dim : int, required.

The input dimension of the encoder.

hidden_dim : int, required.

The hidden dimension used for the _input_ to self attention layers and the _output_ from the feedforward layers.

projection_dim : int, required.

The dimension of the linear projections for the self-attention layers.

feedforward_hidden_dim : int, required.

The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.

num_layers : int, required.

The number of stacked self attention -> feedfoward -> layer normalisation blocks.

num_attention_heads : int, required.

The number of attention heads to use per layer.

use_positional_encoding: ``bool``, optional, (default = True)

Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.

dropout_prob : float, optional, (default = 0.1)

The dropout probability for the feedforward network.

residual_dropout_prob : float, optional, (default = 0.2)

The dropout probability for the residual connections.

attention_dropout_prob : float, optional, (default = 0.1)

The dropout probability for the attention distributions in each attention layer.

forward(inputs: torch.Tensor, mask: torch.Tensor)[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim() → int[source]

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.

get_output_dim() → int[source]

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.

is_bidirectional()[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.

class allennlp.modules.seq2seq_encoders.multi_head_self_attention.MultiHeadSelfAttention(num_heads: int, input_dim: int, attention_dim: int, values_dim: int, output_projection_dim: int = None, attention_dropout_prob: float = 0.1) → None[source]

Bases: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

This class implements the key-value scaled dot product attention mechanism detailed in the paper Attention is all you Need .

The attention mechanism is a weighted sum of a projection V of the inputs, with respect to the scaled, normalised dot product of Q and K, which are also both linear projections of the input. This procedure is repeated for each attention head, using different parameters.

Parameters:
num_heads : int, required.

The number of attention heads to use.

input_dim : int, required.

The size of the last dimension of the input tensor.

attention_dim ``int``, required.

The total dimension of the query and key projections which comprise the dot product attention function. Must be divisible by num_heads.

values_dim : int, required.

The total dimension which the input is projected to for representing the values, which are combined using the attention. Must be divisible by num_heads.

output_projection_dim : int, optional (default = None)

The dimensionality of the final output projection. If this is not passed explicitly, the projection has size input_size.

attention_dropout_prob : float, optional (default = 0.1).

The dropout probability applied to the normalised attention distributions.

forward(inputs: torch.Tensor, mask: torch.LongTensor = None) → torch.FloatTensor[source]
Parameters:
inputs : torch.FloatTensor, required.

A tensor of shape (batch_size, timesteps, input_dim)

mask : torch.FloatTensor, optional (default = None).

A tensor of shape (batch_size, timesteps).

Returns:
A tensor of shape (batch_size, timesteps, output_projection_dim),
where output_projection_dim = input_dim by default.
get_input_dim()[source]
get_output_dim()[source]
is_bidirectional()[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.

class allennlp.modules.seq2seq_encoders.pass_through_encoder.PassThroughEncoder(input_dim: int) → None[source]

Bases: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

This class allows you to specify skipping a Seq2SeqEncoder just by changing a configuration file. This is useful for ablations and measuring the impact of different elements of your model.

forward(inputs: torch.Tensor, mask: torch.LongTensor = None) → torch.FloatTensor[source]

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim() → int[source]

Returns the dimension of the vector input for each element in the sequence input to a Seq2SeqEncoder. This is not the shape of the input tensor, but the last element of that shape.

get_output_dim() → int[source]

Returns the dimension of each vector in the sequence output by this Seq2SeqEncoder. This is not the shape of the returned tensor, but the last element of that shape.

is_bidirectional()[source]

Returns True if this encoder is bidirectional. If so, we assume the forward direction of the encoder is the first half of the final dimension, and the backward direction is the second half.