allennlp.modules.seq2seq_encoders¶
Modules that transform a sequence of input vectors into a sequence of output vectors. Some are just basic wrappers around existing PyTorch modules, others are AllenNLP modules.
The available Seq2Seq encoders are
"alternating_highway_lstm" <allennlp.modules.stacked_alternating_lstm.StackedAlternatingLstm> (GPU only)
"stacked_self_attention"
"multi_head_self_attention"
"pass_through"
"feedforward"

class
allennlp.modules.seq2seq_encoders.pytorch_seq2seq_wrapper.
PytorchSeq2SeqWrapper
(module: torch.nn.modules.module.Module, stateful: bool = False)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
Pytorch’s RNNs have two outputs: the hidden state for every time step, and the hidden state at the last time step for every layer. We just want the first one as a single output. This wrapper pulls out that output, and adds a
get_output_dim()
method, which is useful if you want to, e.g., define a linear + softmax layer on top of this to get some distribution over a set of labels. The linear layer needs to know its input dimension before it is called, and you can get that fromget_output_dim
.In order to be wrapped with this wrapper, a class must have the following members:
self.input_size: int
self.hidden_size: int
def forward(inputs: PackedSequence, hidden_state: torch.Tensor) > Tuple[PackedSequence, torch.Tensor]
.self.bidirectional: bool
(optional)
This is what pytorch’s RNN’s look like  just make sure your class looks like those, and it should work.
Note that we require you to pass a binary mask of shape (batch_size, sequence_length) when you call this module, to avoid subtle bugs around masking. If you already have a
PackedSequence
you can passNone
as the second parameter.We support stateful RNNs where the final state from each batch is used as the initial state for the subsequent batch by passing
stateful=True
to the constructor.
forward
(self, inputs:torch.Tensor, mask:torch.Tensor, hidden_state:torch.Tensor=None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.seq2seq_encoder.
Seq2SeqEncoder
(stateful: bool = False)[source]¶ Bases:
allennlp.modules.encoder_base._EncoderBase
,allennlp.common.registrable.Registrable
A
Seq2SeqEncoder
is aModule
that takes as input a sequence of vectors and returns a modified sequence of vectors. Input shape:(batch_size, sequence_length, input_dim)
; output shape:(batch_size, sequence_length, output_dim)
.We add two methods to the basic
Module
API:get_input_dim()
andget_output_dim()
. You might need this if you want to construct aLinear
layer using the output of this encoder, or to raise sensible errors for mismatching input dimensions.
get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.


class
allennlp.modules.seq2seq_encoders.intra_sentence_attention.
IntraSentenceAttentionEncoder
(input_dim: int, projection_dim: int = None, similarity_function: allennlp.modules.similarity_functions.similarity_function.SimilarityFunction = DotProductSimilarity(), num_attention_heads: int = 1, combination: str = '1, 2', output_dim: int = None)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
An
IntraSentenceAttentionEncoder
is aSeq2SeqEncoder
that merges the original word representations with an attention (for each word) over other words in the sentence. As aSeq2SeqEncoder
, the input to this module is of shape(batch_size, num_tokens, input_dim)
, and the output is of shape(batch_size, num_tokens, output_dim)
.We compute the attention using a configurable
SimilarityFunction
, which could have multiple attention heads. The operation for merging the original representations with the attended representations is also configurable (e.g., you can concatenate them, add them, multiply them, etc.). Parameters
 input_dim
int
The dimension of the vector for each element in the input sequence;
input_tensor.size(1)
. projection_dim
int
, optional If given, we will do a linear projection of the input sequence to this dimension before performing the attentionweighted sum.
 similarity_function
SimilarityFunction
, optional The similarity function to use when computing attentions. Default is to use a dot product.
 num_attention_heads: ``int``, optional
If this is greater than one (default is 1), we will split the input into several “heads” to compute multiheaded weighted sums. Must be used with a multiheaded similarity function, and you almost certainly want to do a projection in conjunction with the multiple heads.
 combination
str
, optional This string defines how we merge the original word representations with the result of the intrasentence attention. This will be passed to
combine_tensors()
; see that function for more detail on exactly how this works, but some simple examples are"1,2"
for concatenation (the default),"1+2"
for adding the two, or"2"
for only keeping the attention representation. output_dim
int
, optional (default = None) The dimension of an optional output projection.
 input_dim

forward
(self, tokens:torch.Tensor, mask:torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.stacked_self_attention.
StackedSelfAttentionEncoder
(input_dim: int, hidden_dim: int, projection_dim: int, feedforward_hidden_dim: int, num_layers: int, num_attention_heads: int, use_positional_encoding: bool = True, dropout_prob: float = 0.1, residual_dropout_prob: float = 0.2, attention_dropout_prob: float = 0.1)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
Implements a stacked selfattention encoder similar to the Transformer architecture in Attention is all you Need .
This encoder combines 3 layers in a ‘block’:
A 2 layer FeedForward network.
Multiheaded self attention, which uses 2 learnt linear projections to perform a dotproduct similarity between every pair of elements scaled by the square root of the sequence length.
Layer Normalisation.
These are then stacked into
num_layers
layers. Parameters
 input_dim
int
, required. The input dimension of the encoder.
 hidden_dim
int
, required. The hidden dimension used for the _input_ to self attention layers and the _output_ from the feedforward layers.
 projection_dim
int
, required. The dimension of the linear projections for the selfattention layers.
 feedforward_hidden_dim
int
, required. The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.
 num_layers
int
, required. The number of stacked self attention > feedfoward > layer normalisation blocks.
 num_attention_heads
int
, required. The number of attention heads to use per layer.
 use_positional_encoding: ``bool``, optional, (default = True)
Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.
 dropout_prob
float
, optional, (default = 0.1) The dropout probability for the feedforward network.
 residual_dropout_prob
float
, optional, (default = 0.2) The dropout probability for the residual connections.
 attention_dropout_prob
float
, optional, (default = 0.1) The dropout probability for the attention distributions in each attention layer.
 input_dim

forward
(self, inputs:torch.Tensor, mask:torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.multi_head_self_attention.
MultiHeadSelfAttention
(num_heads: int, input_dim: int, attention_dim: int, values_dim: int, output_projection_dim: int = None, attention_dropout_prob: float = 0.1)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
This class implements the keyvalue scaled dot product attention mechanism detailed in the paper Attention is all you Need .
The attention mechanism is a weighted sum of a projection V of the inputs, with respect to the scaled, normalised dot product of Q and K, which are also both linear projections of the input. This procedure is repeated for each attention head, using different parameters.
 Parameters
 num_heads
int
, required. The number of attention heads to use.
 input_dim
int
, required. The size of the last dimension of the input tensor.
 attention_dim ``int``, required.
The total dimension of the query and key projections which comprise the dot product attention function. Must be divisible by
num_heads
. values_dim
int
, required. The total dimension which the input is projected to for representing the values, which are combined using the attention. Must be divisible by
num_heads
. output_projection_dim
int
, optional (default = None) The dimensionality of the final output projection. If this is not passed explicitly, the projection has size input_size.
 attention_dropout_prob
float
, optional (default = 0.1). The dropout probability applied to the normalised attention distributions.
 num_heads

forward
(self, inputs:torch.Tensor, mask:torch.LongTensor=None) → torch.FloatTensor[source]¶  Parameters
 inputs
torch.FloatTensor
, required. A tensor of shape (batch_size, timesteps, input_dim)
 mask
torch.FloatTensor
, optional (default = None). A tensor of shape (batch_size, timesteps).
 inputs
 Returns
 A tensor of shape (batch_size, timesteps, output_projection_dim),
 where output_projection_dim = input_dim by default.

get_input_dim
(self)[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.pass_through_encoder.
PassThroughEncoder
(input_dim: int)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
This class allows you to specify skipping a
Seq2SeqEncoder
just by changing a configuration file. This is useful for ablations and measuring the impact of different elements of your model.
forward
(self, inputs:torch.Tensor, mask:torch.LongTensor=None) → torch.Tensor[source]¶  Parameters
 inputs
torch.Tensor
, required. A tensor of shape (batch_size, timesteps, input_dim)
 mask
torch.LongTensor
, optional (default = None). A tensor of shape (batch_size, timesteps).
 inputs
 Returns
 A tensor of shape (batch_size, timesteps, output_dim),
 where output_dim = input_dim.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.


class
allennlp.modules.seq2seq_encoders.gated_cnn_encoder.
GatedCnnEncoder
(input_dim: int, layers: Sequence[Sequence[Sequence[int]]], dropout: float = 0.0, return_all_layers: bool = False)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
This is workinprogress and has not been fully tested yet. Use at your own risk!
A
Seq2SeqEncoder
that uses a Gated CNN.see
Language Modeling with Gated Convolutional Networks, Yann N. Dauphin et al, ICML 2017 https://arxiv.org/abs/1612.08083
Convolutional Sequence to Sequence Learning, Jonas Gehring et al, ICML 2017 https://arxiv.org/abs/1705.03122
Some possibilities:
Each element of the list is wrapped in a residual block: input_dim = 512 layers = [ [[4, 512]], [[4, 512], [4, 512]], [[4, 512], [4, 512]], [[4, 512], [4, 512]] dropout = 0.05
A “bottleneck architecture” input_dim = 512 layers = [ [[4, 512]], [[1, 128], [5, 128], [1, 512]], … ]
An architecture with dilated convolutions input_dim = 512 layers = [ [[2, 512, 1]], [[2, 512, 2]], [[2, 512, 4]], [[2, 512, 8]], # receptive field == 16 [[2, 512, 1]], [[2, 512, 2]], [[2, 512, 4]], [[2, 512, 8]], # receptive field == 31 [[2, 512, 1]], [[2, 512, 2]], [[2, 512, 4]], [[2, 512, 8]], # receptive field == 46 [[2, 512, 1]], [[2, 512, 2]], [[2, 512, 4]], [[2, 512, 8]], # receptive field == 57 ]
 Parameters
 input_dimint
The dimension of the inputs.
 layers
Sequence[Sequence[Sequence[int]]]`
The layer dimensions for each
ResidualBlock
. dropoutfloat, optional (default = 0.0)
The dropout for each
ResidualBlock
. return_all_layersbool, optional (default: False)
Whether to return all layers or just the last layer.

forward
(self, token_embeddings:torch.Tensor, mask:torch.Tensor)[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.gated_cnn_encoder.
ResidualBlock
(input_dim: int, layers: Sequence[Sequence[int]], direction: str, do_weight_norm: bool = True, dropout: float = 0.0)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(self, x:torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

The BidirectionalTransformerEncoder from Calypso. This is basically the transformer from https://nlp.seas.harvard.edu/2018/04/03/attention.html so credit to them.
This code should be considered “private” in that we have several transformer implementations and may end up deleting this one. If you use it, consider yourself warned.

class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
BidirectionalLanguageModelTransformer
(input_dim: int, hidden_dim: int, num_layers: int, dropout: float = 0.1, input_dropout: float = None, return_all_layers: bool = False)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder

forward
(self, token_embeddings:torch.Tensor, mask:torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_attention_masks
(self, mask:torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶ Returns 2 masks of shape (batch_size, timesteps, timesteps) representing 1) nonpadded elements, and 2) elements of the sequence which are permitted to be involved in attention at a given timestep.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.


class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
EncoderLayer
(size: int, self_attn: torch.nn.modules.module.Module, feed_forward: torch.nn.modules.module.Module, dropout: float)[source]¶ Bases:
torch.nn.modules.module.Module
Encoder is made up of selfattn and feed forward (defined below)

class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
MultiHeadedAttention
(num_heads: int, input_dim: int, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module

forward
(self, query:torch.Tensor, key:torch.Tensor, value:torch.Tensor, mask:torch.Tensor=None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
PositionalEncoding
(input_dim: int, max_len: int = 5000)[source]¶ Bases:
torch.nn.modules.module.Module
Implement the Positional Encoding function.

forward
(self, x:torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
PositionwiseFeedForward
(input_dim: int, ff_dim: int, dropout: float = 0.1)[source]¶ Bases:
torch.nn.modules.module.Module
Implements FFN equation.

forward
(self, x:torch.Tensor) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.


class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
SublayerConnection
(size: int, dropout: float)[source]¶ Bases:
torch.nn.modules.module.Module
A residual connection followed by a layer norm. Note for code simplicity the norm is first as opposed to last.

class
allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
TransformerEncoder
(layer: torch.nn.modules.module.Module, num_layers: int, return_all_layers: bool = False)[source]¶ Bases:
torch.nn.modules.module.Module
Core encoder is a stack of N layers

allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
attention
(query:torch.Tensor, key:torch.Tensor, value:torch.Tensor, mask:torch.Tensor=None, dropout:Callable=None) → Tuple[torch.Tensor, torch.Tensor][source]¶ Compute ‘Scaled Dot Product Attention’

allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
make_model
(num_layers:int=6, input_size:int=512, hidden_size:int=2048, heads:int=8, dropout:float=0.1, return_all_layers:bool=False) → allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.TransformerEncoder[source]¶ Helper: Construct a model from hyperparameters.

allennlp.modules.seq2seq_encoders.bidirectional_language_model_transformer.
subsequent_mask
(size:int, device:str='cpu') → torch.Tensor[source]¶ Mask out subsequent positions.

class
allennlp.modules.seq2seq_encoders.feedforward_encoder.
FeedForwardEncoder
(feedforward: allennlp.modules.feedforward.FeedForward)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
This class applies the FeedForward to each item in sequences.

forward
(self, inputs:torch.Tensor, mask:torch.LongTensor=None) → torch.Tensor[source]¶  Parameters
 inputs
torch.Tensor
, required. A tensor of shape (batch_size, timesteps, input_dim)
 mask
torch.LongTensor
, optional (default = None). A tensor of shape (batch_size, timesteps).
 inputs
 Returns
 A tensor of shape (batch_size, timesteps, output_dim).

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.


class
allennlp.modules.seq2seq_encoders.qanet_encoder.
QaNetEncoder
(input_dim: int, hidden_dim: int, attention_projection_dim: int, feedforward_hidden_dim: int, num_blocks: int, num_convs_per_block: int, conv_kernel_size: int, num_attention_heads: int, use_positional_encoding: bool = True, dropout_prob: float = 0.1, layer_dropout_undecayed_prob: float = 0.1, attention_dropout_prob: float = 0)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
Stack multiple QANetEncoderBlock into one sequence encoder.
 Parameters
 input_dim
int
, required. The input dimension of the encoder.
 hidden_dim
int
, required. The hidden dimension used for convolution output channels, multihead attention output and the final output of feedforward layer.
 attention_projection_dim
int
, required. The dimension of the linear projections for the selfattention layers.
 feedforward_hidden_dim
int
, required. The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.
 num_blocks
int
, required. The number of stacked encoder blocks.
 num_convs_per_block: ``int``, required.
The number of convolutions in each block.
 conv_kernel_size: ``int``, required.
The kernel size for convolution.
 num_attention_heads
int
, required. The number of attention heads to use per layer.
 use_positional_encoding: ``bool``, optional, (default = True)
Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.
 dropout_prob
float
, optional, (default = 0.1) The dropout probability for the feedforward network.
 layer_dropout_undecayed_prob
float
, optional, (default = 0.1) The initial dropout probability for layer dropout, and this might decay w.r.t the depth of the layer. For each minibatch, the convolution/attention/ffn sublayer is stochastically dropped according to its layer dropout probability.
 attention_dropout_prob
float
, optional, (default = 0) The dropout probability for the attention distributions in the attention layer.
 input_dim

forward
(self, inputs:torch.Tensor, mask:torch.Tensor=None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.

class
allennlp.modules.seq2seq_encoders.qanet_encoder.
QaNetEncoderBlock
(input_dim: int, hidden_dim: int, attention_projection_dim: int, feedforward_hidden_dim: int, num_convs: int, conv_kernel_size: int, num_attention_heads: int, use_positional_encoding: bool = True, dropout_prob: float = 0.1, layer_dropout_undecayed_prob: float = 0.1, attention_dropout_prob: float = 0)[source]¶ Bases:
allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder
Implements the encoder block described in QANet: Combining Local Convolution with Global Selfattention for Reading Comprehension .
One encoder block mainly contains 4 parts:
Add position embedding.
Several depthwise seperable convolutions.
Multiheaded self attention, which uses 2 learnt linear projections to perform a dotproduct similarity between every pair of elements scaled by the square root of the sequence length.
A twolayer FeedForward network.
 Parameters
 input_dim
int
, required. The input dimension of the encoder.
 hidden_dim
int
, required. The hidden dimension used for convolution output channels, multihead attention output and the final output of feedforward layer.
 attention_projection_dim
int
, required. The dimension of the linear projections for the selfattention layers.
 feedforward_hidden_dim
int
, required. The middle dimension of the FeedForward network. The input and output dimensions are fixed to ensure sizes match up for the self attention layers.
 num_convs: ``int``, required.
The number of convolutions in each block.
 conv_kernel_size: ``int``, required.
The kernel size for convolution.
 num_attention_heads
int
, required. The number of attention heads to use per layer.
 use_positional_encoding: ``bool``, optional, (default = True)
Whether to add sinusoidal frequencies to the input tensor. This is strongly recommended, as without this feature, the self attention layers have no idea of absolute or relative position (as they are just computing pairwise similarity between vectors of elements), which can be important features for many tasks.
 dropout_prob
float
, optional, (default = 0.1) The dropout probability for the feedforward network.
 layer_dropout_undecayed_prob
float
, optional, (default = 0.1) The initial dropout probability for layer dropout, and this might decay w.r.t the depth of the layer. For each minibatch, the convolution/attention/ffn sublayer is randomly dropped according to its layer dropout probability.
 attention_dropout_prob
float
, optional, (default = 0) The dropout probability for the attention distributions in the attention layer.
 input_dim

forward
(self, inputs:torch.Tensor, mask:torch.Tensor=None) → torch.Tensor[source]¶ Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_input_dim
(self) → int[source]¶ Returns the dimension of the vector input for each element in the sequence input to a
Seq2SeqEncoder
. This is not the shape of the input tensor, but the last element of that shape.