# allennlp.training.metrics¶

A Metric is some quantity or quantities that can be accumulated during training or evaluation; for example, accuracy or F1 score.

class allennlp.training.metrics.metric.Metric[source]

A very general abstract class representing a metric which can be accumulated.

get_metric(reset: bool) → typing.Union[float, typing.Tuple[float, ...], typing.Dict[str, float]][source]

Compute and return the metric. Optionally also call self.reset().

reset() → None[source]

Reset any accumulators or internal state.

static unwrap_to_tensors(*tensors: torch.Tensor)[source]

If you actually passed gradient-tracking Tensors to a Metric, there will be a huge memory leak, because it will prevent garbage collection for the computation graph. This method ensures that you’re using tensors directly and that they are on the CPU.

class allennlp.training.metrics.attachment_scores.AttachmentScores(ignore_classes: typing.List[int] = None) → None[source]

Computes labeled and unlabeled attachment scores for a dependency parse, as well as sentence level exact match for both labeled and unlabeled trees. Note that the input to this metric is the sampled predictions, not the distribution itself.

Parameters: ignore_classes : List[int], optional (default = None) A list of label ids to ignore when computing metrics.
get_metric(reset: bool = False)[source]
Returns: The accumulated metrics as a dictionary.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.average.Average → None[source]

This Metric breaks with the typical Metric API and just stores values that were computed in some fashion outside of a Metric. If you have some external code that computes the metric for you, for instance, you can use this to report the average result using our Metric API.

get_metric(reset: bool = False)[source]
Returns: The average of all values that were passed to __call__.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.boolean_accuracy.BooleanAccuracy → None[source]

Just checks batch-equality of two tensors and computes an accuracy metric based on that. That is, if your prediction has shape (batch_size, dim_1, ..., dim_n), this metric considers that as a set of batch_size predictions and checks that each is entirely correct across the remaining dims. This means the denominator in the accuracy computation is batch_size, with the caveat that predictions that are totally masked are ignored (in which case the denominator is the number of predictions that have at least one unmasked element).

This is similar to CategoricalAccuracy, if you’ve already done a .max() on your predictions. If you have categorical output, though, you should typically just use CategoricalAccuracy. The reason you might want to use this instead is if you’ve done some kind of constrained inference and don’t have a prediction tensor that matches the API of CategoricalAccuracy, which assumes a final dimension of size num_classes.

get_metric(reset: bool = False)[source]
Returns: The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.bleu.BLEU(ngram_weights: typing.Iterable[float] = (0.25, 0.25, 0.25, 0.25), exclude_indices: typing.Set[int] = None) → None[source]

Bilingual Evaluation Understudy (BLEU).

BLEU is a common metric used for evaluating the quality of machine translations against a set of reference translations. See Papineni et. al., “BLEU: a method for automatic evaluation of machine translation”, 2002.

Parameters: ngram_weights : Iterable[float], optional (default = (0.25, 0.25, 0.25, 0.25)) Weights to assign to scores for each ngram size. exclude_indices : Set[int], optional (default = None) Indices to exclude when calculating ngrams. This should usually include the indices of the start, end, and pad tokens.

Notes

We chose to implement this from scratch instead of wrapping an existing implementation (such as nltk.translate.bleu_score) for a two reasons. First, so that we could pass tensors directly to this metric instead of first converting the tensors to lists of strings. And second, because functions like nltk.translate.bleu_score.corpus_bleu() are meant to be called once over the entire corpus, whereas it is more efficient in our use case to update the running precision counts every batch.

This implementation only considers a reference set of size 1, i.e. a single gold target sequence for each predicted sequence.

get_metric(reset: bool = False) → typing.Dict[str, float][source]

Compute and return the metric. Optionally also call self.reset().

reset() → None[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.categorical_accuracy.CategoricalAccuracy(top_k: int = 1, tie_break: bool = False) → None[source]

Categorical Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class. Tie break enables equal distribution of scores among the classes with same maximum predicted scores.

get_metric(reset: bool = False)[source]
Returns: The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.ConllCorefScores → None[source]
static get_gold_clusters(gold_clusters)[source]
get_metric(reset: bool = False) → typing.Tuple[[float, float], float][source]

Compute and return the metric. Optionally also call self.reset().

static get_predicted_clusters(top_spans: torch.Tensor, antecedent_indices: torch.Tensor, predicted_antecedents: torch.Tensor) → typing.Tuple[typing.List[typing.Tuple[typing.Tuple[int, int], ...]], typing.Dict[typing.Tuple[int, int], typing.Tuple[typing.Tuple[int, int], ...]]][source]
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.Scorer(metric)[source]

Bases: object

Mostly borrowed from <https://github.com/clarkkev/deep-coref/blob/master/evaluation.py>

static b_cubed(clusters, mention_to_gold)[source]

Averaged per-mention precision and recall. <https://pdfs.semanticscholar.org/cfe3/c24695f1c14b78a5b8e95bcbd1c666140fd1.pdf>

static ceafe(clusters, gold_clusters)[source]

Computes the Constrained EntityAlignment F-Measure (CEAF) for evaluating coreference. Gold and predicted mentions are aligned into clusterings which maximise a metric - in this case, the F measure between gold and predicted clusters.

get_f1()[source]
get_precision()[source]
get_prf()[source]
get_recall()[source]
static muc(clusters, mention_to_gold)[source]

Counts the mentions in each predicted cluster which need to be re-allocated in order for each predicted cluster to be contained by the respective gold cluster. <http://aclweb.org/anthology/M/M95/M95-1005.pdf>

static phi4(gold_clustering, predicted_clustering)[source]

Subroutine for ceafe. Computes the mention F measure between gold and predicted mentions in a cluster.

update(predicted, gold, mention_to_predicted, mention_to_gold)[source]
class allennlp.training.metrics.covariance.Covariance → None[source]

This Metric calculates the unbiased sample covariance between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the covariance is calculated between the vectors).

This implementation is mostly modeled after the streaming_covariance function in Tensorflow. See: https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3127

The following is copied from the Tensorflow documentation:

The algorithm used for this online computation is described in https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online . Specifically, the formula used to combine two sample comoments is C_AB = C_A + C_B + (E[x_A] - E[x_B]) * (E[y_A] - E[y_B]) * n_A * n_B / n_AB The comoment for a single batch of data is simply sum((x - E[x]) * (y - E[y])), optionally masked.

get_metric(reset: bool = False)[source]
Returns: The accumulated covariance.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.entropy.Entropy → None[source]
get_metric(reset: bool = False)[source]
Returns: The scalar average entropy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.evalb_bracketing_scorer.EvalbBracketingScorer(evalb_directory_path: str = '/local/deploy/agent5/work/8feb324ce7c68d53/allennlp/tools/EVALB', evalb_param_filename: str = 'COLLINS.prm') → None[source]

This class uses the external EVALB software for computing a broad range of metrics on parse trees. Here, we use it to compute the Precision, Recall and F1 metrics. You can download the source for EVALB from here: <http://nlp.cs.nyu.edu/evalb/>.

Note that this software is 20 years old. In order to compile it on modern hardware, you may need to remove an include <malloc.h> statement in evalb.c before it will compile.

AllenNLP contains the EVALB software, but you will need to compile it yourself before using it because the binary it generates is system dependent. To build it, run make inside the allennlp/tools/EVALB directory.

Note that this metric reads and writes from disk quite a bit. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.

Parameters: evalb_directory_path : str, required. The directory containing the EVALB executable. evalb_param_filename: str, optional (default = “COLLINS.prm”) The relative name of the EVALB configuration file used when scoring the trees. By default, this uses the COLLINS.prm configuration file which comes with EVALB. This configuration ignores POS tags and some punctuation labels.
static clean_evalb(evalb_directory_path: str = '/local/deploy/agent5/work/8feb324ce7c68d53/allennlp/tools/EVALB')[source]
static compile_evalb(evalb_directory_path: str = '/local/deploy/agent5/work/8feb324ce7c68d53/allennlp/tools/EVALB')[source]
get_metric(reset: bool = False)[source]
Returns: The average precision, recall and f1.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.f1_measure.F1Measure(positive_label: int) → None[source]

Computes Precision, Recall and F1 with respect to a given positive_label. For example, for a BIO tagging scheme, you would pass the classification index of the tag you are interested in, resulting in the Precision, Recall and F1 score being calculated for this tag only.

get_metric(reset: bool = False)[source]
Returns: A tuple of the following metrics based on the accumulated count statistics: precision : float recall : float f1-measure : float
reset()[source]
class allennlp.training.metrics.mean_absolute_error.MeanAbsoluteError → None[source]

This Metric calculates the mean absolute error (MAE) between two tensors.

get_metric(reset: bool = False)[source]
Returns: The accumulated mean absolute error.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.mention_recall.MentionRecall → None[source]
get_metric(reset: bool = False) → float[source]

Compute and return the metric. Optionally also call self.reset().

reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.pearson_correlation.PearsonCorrelation → None[source]

This Metric calculates the sample Pearson correlation coefficient (r) between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the correlation is calculated between the vectors).

This implementation is mostly modeled after the streaming_pearson_correlation function in Tensorflow. See https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3267

This metric delegates to the Covariance metric the tracking of three [co]variances:

• covariance(predictions, labels), i.e. covariance
• covariance(predictions, predictions), i.e. variance of predictions
• covariance(labels, labels), i.e. variance of labels

If we have these values, the sample Pearson correlation coefficient is simply:

r = covariance * (sqrt(predictions_variance) * sqrt(labels_variance))

get_metric(reset: bool = False)[source]
Returns: The accumulated sample Pearson correlation.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.sequence_accuracy.SequenceAccuracy → None[source]

Sequence Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(reset: bool = False)[source]
Returns: The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.span_based_f1_measure.SpanBasedF1Measure(vocabulary: allennlp.data.vocabulary.Vocabulary, tag_namespace: str = 'tags', ignore_classes: typing.List[str] = None, label_encoding: typing.Union[str, NoneType] = 'BIO', tags_to_spans_function: typing.Union[typing.Callable[[typing.List[str], typing.Union[typing.List[str], NoneType]], typing.List[typing.Tuple[str, typing.Tuple[int, int]]]], NoneType] = None) → None[source]

The Conll SRL metrics are based on exact span matching. This metric implements span-based precision and recall metrics for a BIO tagging scheme. It will produce precision, recall and F1 measures per tag, as well as overall statistics. Note that the implementation of this metric is not exactly the same as the perl script used to evaluate the CONLL 2005 data - particularly, it does not consider continuations or reference spans as constituents of the original span. However, it is a close proxy, which can be helpful for judging model performance during training. This metric works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, “O” if using the “BIO” label encoding).

get_metric(reset: bool = False)[source]
Returns: A Dict per label containing following the span based metrics: precision : float recall : float f1-measure : float Additionally, an overall key is included, which provides the precision, recall and f1-measure for all spans.
reset()[source]

This Metric takes the best span string computed by a model, along with the answer strings labeled in the data, and computed exact match and F1 score using the official SQuAD evaluation script.

get_metric(reset: bool = False) → typing.Tuple[float, float][source]
Returns: Average exact match and F1 score (in that order) as computed by the official SQuAD script over all inputs.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.unigram_recall.UnigramRecall → None[source]

Unigram top-K recall. This does not take word order into account. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(reset: bool = False)[source]
Returns: The accumulated recall.
reset()[source]

Reset any accumulators or internal state.