class allennlp.models.bidirectional_lm.BidirectionalLanguageModel(vocab:, text_field_embedder: allennlp.modules.text_field_embedders.text_field_embedder.TextFieldEmbedder, contextualizer: allennlp.modules.seq2seq_encoders.seq2seq_encoder.Seq2SeqEncoder, dropout: float = None, loss_scale: typing.Union[float, str] = 1.0, num_samples: int = None, sparse_embeddings: bool = False, initializer: allennlp.nn.initializers.InitializerApplicator = None) → None[source]

Bases: allennlp.models.model.Model

The BidirectionalLanguageModel applies a bidirectional “contextualizing” Seq2SeqEncoder to uncontextualized embeddings, using a SoftmaxLoss module (defined above) to compute the language modeling loss.

It is IMPORTANT that your bidirectional Seq2SeqEncoder does not do any “peeking ahead”. That is, for its forward direction it should only consider embeddings at previous timesteps, and for its backward direction only embeddings at subsequent timesteps. If this condition is not met, your language model is cheating.

vocab: ``Vocabulary``
text_field_embedder: ``TextFieldEmbedder``

Used to embed the indexed tokens we get in forward.

contextualizer: ``Seq2SeqEncoder``

Used to “contextualize” the embeddings. As described above, this encoder must not cheat by peeking ahead.

dropout: ``float``, optional (default: None)

If specified, dropout is applied to the contextualized embeddings before computation of the softmax. The contextualized embeddings themselves are returned without dropout.

loss_scale: ``Union[float, str]``, optional (default: 1.0)

This scaling factor is applied to the average language model loss. You can also specify "n_samples" in which case we compute total loss across all predictions.

num_samples: ``int``, optional (default: None)

If provided, the model will use SampledSoftmaxLoss with the specified number of samples. Otherwise, it will use the full _SoftmaxLoss defined above.

sparse_embeddings: ``bool``, optional (default: False)

Passed on to SampledSoftmaxLoss if True.

delete_softmax() → None[source]

Remove the softmax weights. Useful for saving memory when calculating the loss is not necessary, e.g. in an embedder.

forward(source: typing.Dict[str, torch.LongTensor]) → typing.Dict[str, torch.Tensor][source]

Computes the averaged forward and backward LM loss from the batch.

By convention, the input dict is required to have at least a "tokens" entry that’s the output of a SingleIdTokenIndexer, which is used to compute the language model targets.

tokens: ``torch.Tensor``, required.

The output of Batch.as_tensor_dict() for a batch of sentences.

Dict with keys:
``’loss’``: ``torch.Tensor``

averaged forward/backward negative log likelihood

``’forward_loss’``: ``torch.Tensor``

forward direction negative log likelihood

``’backward_loss’``: ``torch.Tensor``

backward direction negative log likelihood

``’lm_embeddings’``: ``Union[torch.Tensor, List[torch.Tensor]]``

(batch_size, timesteps, embed_dim) tensor of top layer contextual representations or list of all layers. No dropout applied.

``’noncontextual_token_embeddings’``: ``torch.Tensor``

(batch_size, timesteps, token_embed_dim) tensor of bottom layer noncontextual representations

``’mask’``: ``torch.Tensor``

(batch_size, timesteps) mask for the embeddings

num_layers() → int[source]

Returns the depth of this LM. That is, how many layers the contextualizer has plus one for the non-contextual layer.