Configuring Experiments#

Now that we know how to train and evaluate models, let's take a deeper look at our experiment configuration file, tutorials/getting_started/walk_through_allennlp/simple_tagger.json.

The configuration is a Jsonnet file that defines all the parameters for our experiment and model. Don't worry if you're not familiar with Jsonnet, any JSON file is valid Jsonnet; indeed, the configuration file we use in this tutorial is just JSON.

In this tutorial we'll go through each section of the configuration file in detail, explaining what all the parameters mean.

A preliminary: Registrable and from_params#

Most AllenNLP classes inherit from the Registrable base class, which gives them a named registry for their subclasses. This means that if we had a Model(Registrable) base class (we do), and we decorated a subclass like

@Model.register("custom")
class CustomModel(Model):
    ...

then we would be able to recover the CustomModel class using

Model.by_name("custom")

By convention, all such classes have a from_params factory method that allows you to instantiate instances from a Params object, which is basically a dict of parameters with some added functionality that we won't get into here.

This is how AllenNLP is able to use configuration files to instantiate the objects it needs. It can do (in essence):

# Grab the part of the `config` that defines the model
model_params = config.pop("model")

# Find out which model subclass we want
model_name = model_params.pop("type")

# Instantiate that subclass with the remaining model params
model = Model.by_name(model_name).from_params(model_params)

Because a class doesn't get registered until it's loaded, any code that uses BaseClass.by_name('subclass_name') must have already imported the code for the subclass. In particular, this means that once you start creating your own named models and helper classes, the allennlp command will not be aware of them unless you specify them with --include-package. You can also create your own script that imports your custom classes and then calls allennlp.commands.main().

Batches and Instances and Fields#

We train and evaluate our models on Batches. A Batch is a collection of Instances. In our tagging experiment, each dataset is a collection of tagged sentences, and each instance is one of those tagged sentences.

An instance consists of Fields, each of which represents some part of the instance as arrays suitable for feeding into a model.

In our tagging setup, each instance will contain a TextField representing the words/tokens of the sentence and a SequenceLabelField representing the corresponding part-of-speech tags.

How do we turn a text file full of sentences into Batches? With a DatasetReader specified by our configuration file.

DatasetReaders#

The first section of our configuration file defines the dataset_reader:

  "dataset_reader": {
    "type": "sequence_tagging",
    "word_tag_delimiter": "/",
    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters"
      }
    }
  }

Here we've specified that we want to use the DatasetReader subclass that's registered under the name "sequence_tagging". Unsurprisingly, this is the SequenceTaggingDatasetReader subclass. This reader assumes a text file of newline-separated sentences, where each sentence looks like the following for some "word tag delimiter" {wtd} and some "token delimiter" {td}.

word1{wtd}tag1{td}word2{wtd}tag2{td}...{td}wordn{wtd}tagn

Our data files look like

The/at detectives/nns placed/vbd Barco/np under/in arrest/nn

which is why we need to specify

    "word_tag_delimiter": "/",

We don't need to specify anything for the "token delimiter", since the default split-on-whitespace behavior is already correct.

If you look at the code for SequenceTaggingDatasetReader.read(), it turns each sentence into a TextField of tokens and a SequenceLabelField of tags. The latter isn't really configurable, but the former wants a dictionary of TokenIndexers that indicate how to convert the tokens into arrays.

Our configuration specifies two token indexers:

    "token_indexers": {
      "tokens": {
        "type": "single_id",
        "lowercase_tokens": true
      },
      "token_characters": {
        "type": "characters"
      }
    }

The first, "tokens", is a SingleIdTokenIndexer that just represents each token (word) as a single integer. The configuration also specifies that we lowercase the tokens before encoding; that is, that this token indexer should ignore case.

The second, "token_characters", is a TokenCharactersIndexer that represents each token as a list of int-encoded characters.

Notice that this gives us two different encodings for each token. Each encoding has a name, in this case "tokens" and "token_characters", and these names will be referenced later by the model.

Training and Validation Data#

The next section specifies the data to train and validate the model on:

  "train_data_path": "https://allennlp.s3.amazonaws.com/datasets/getting-started/sentences.small.train",
  "validation_data_path": "https://allennlp.s3.amazonaws.com/datasets/getting-started/sentences.small.dev",

They can be specified either as local paths on your machine or as URLs to files hosted on, for example, Amazon S3. In the latter case, AllenNLP will cache (and reuse) the downloaded files in ~/.allennlp/datasets, using the ETag to determine when to download a new version.

The Model#

The next section configures our model.

  "model": {
    "type": "simple_tagger",

This indicates we want to use the Model subclass that's registered as "simple_tagger", which is the SimpleTagger model.

If you look at its code, you'll see it consists of a TextFieldEmbedder that embeds the output of our text fields, a Seq2SeqEncoder that transforms that sequence into an output sequence, and a linear layer to convert the output sequences into logits representing the probabilities of predicted tags. (The last layer is not configurable and so won't appear in our configuration file.)

The Text Field Embedder#

Let's first look at the text field embedder configuration:

  "text_field_embedder": {
    "tokens": {
      "type": "embedding",
      "embedding_dim": 50
    },
    "token_characters": {
      "type": "character_encoding",
      "embedding": {
        "embedding_dim": 8
      },
      "encoder": {
        "type": "cnn",
        "embedding_dim": 8,
        "num_filters": 50,
        "ngram_filter_sizes": [
          5
        ]
      },
      "dropout": 0.2
    }
  },

You can see that it has an entry for each of the named encodings in our TextField. Each entry specifies a TokenEmbedder that indicates how to embed the tokens encoded with that name. The TextFieldEmbedder's output is the concatenation of these embeddings.

The "tokens" input (which consists of integer encodings of the lowercased words in the input) gets fed into an Embedding module that embeds the vocabulary words in a 50-dimensional space, as specified by the embedding_dim parameter.

The "token_characters" input (which consists of integer-sequence encodings of the characters in each word) gets fed into a TokenCharactersEncoder, which embeds the characters in an 8-dimensional space and then applies a CnnEncoder that uses 50 filters and so also produces a 50-dimensional output. You can see that this encoder also uses a 20% dropout during training.

The output of this TextFieldEmbedder is a 50-dimensional vector for "tokens" concatenated with a 50-dimensional vector for "token_characters"; that is, a 100-dimensional vector.

Because both the encoding of TextFields and the TextFieldEmbedder are configurable in this way, it is trivial to experiment with different word representations as input to your model, switching between simple word embeddings, word embeddings concatenated with a character-level CNN, or even using a pre-trained model to get word-in-context embeddings, without changing a single line of code.

The Seq2SeqEncoder#

The output of the TextFieldEmbedder is processed by the "stacked encoder", which needs to be a Seq2SeqEncoder:

    "encoder": {
      "type": "lstm",
      "input_size": 100,
      "hidden_size": 100,
      "num_layers": 2,
      "dropout": 0.5,
      "bidirectional": true
    }

Here the "lstm" encoder is just a thin wrapper around torch.nn.LSTM, and its parameters are simply passed through to the PyTorch constructor. Its input size needs to match the 100-dimensional output size of the previous embedding layer.

And, as mentioned above, the output of this layer gets passed to a linear layer that doesn't need any configuration. That's all for the model.

Training the Model#

The rest of the config file is dedicated to the training process.

  "iterator": {"type": "basic", "batch_size": 32},

We'll iterate over our datasets using a BasicIterator that pads our data and processes it in batches of size 32.

  "trainer": {
    "optimizer": "adam",
    "num_epochs": 40,
    "patience": 10,
    "cuda_device": -1
  }
}

Finally, we'll optimize using torch.optim.Adam with its default parameters; we'll run the training for 40 epochs; we'll stop prematurely if we get no improvement for 10 epochs; and we'll train on the CPU. If you wanted to train on a GPU, you'd change cuda_device to its device id. If you have just one GPU that should be 0.

That's our entire experiment configuration. If we want to change our optimizer, our batch size, our embedding dimensions, or any other hyperparameters, all we need to do is modify this config file and train another model.

The training configuration is always saved as part of the model archive, which means that you can always see how a saved model was trained.

Next Steps#

Continue on to our Creating a Model tutorial.