# allennlp.training.metrics¶

A Metric is some quantity or quantities that can be accumulated during training or evaluation; for example, accuracy or F1 score.

class allennlp.training.metrics.metric.Metric[source]

A very general abstract class representing a metric which can be accumulated.

get_metric(self, reset:bool) → Union[float, Tuple[float, ...], Dict[str, float], Dict[str, List[float]]][source]

Compute and return the metric. Optionally also call self.reset().

reset(self) → None[source]

Reset any accumulators or internal state.

static unwrap_to_tensors(*tensors:torch.Tensor)[source]

If you actually passed gradient-tracking Tensors to a Metric, there will be a huge memory leak, because it will prevent garbage collection for the computation graph. This method ensures that you’re using tensors directly and that they are on the CPU.

class allennlp.training.metrics.attachment_scores.AttachmentScores(ignore_classes: List[int] = None)[source]

Computes labeled and unlabeled attachment scores for a dependency parse, as well as sentence level exact match for both labeled and unlabeled trees. Note that the input to this metric is the sampled predictions, not the distribution itself.

Parameters
ignore_classesList[int], optional (default = None)

A list of label ids to ignore when computing metrics.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated metrics as a dictionary.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.auc.Auc(positive_label=1)[source]

The AUC Metric measures the area under the receiver-operating characteristic (ROC) curve for binary classification problems.

get_metric(self, reset:bool=False)[source]

Compute and return the metric. Optionally also call self.reset().

reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.average.Average[source]

This Metric breaks with the typical Metric API and just stores values that were computed in some fashion outside of a Metric. If you have some external code that computes the metric for you, for instance, you can use this to report the average result using our Metric API.

get_metric(self, reset:bool=False)[source]
Returns
The average of all values that were passed to __call__.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.boolean_accuracy.BooleanAccuracy[source]

Just checks batch-equality of two tensors and computes an accuracy metric based on that. That is, if your prediction has shape (batch_size, dim_1, …, dim_n), this metric considers that as a set of batch_size predictions and checks that each is entirely correct across the remaining dims. This means the denominator in the accuracy computation is batch_size, with the caveat that predictions that are totally masked are ignored (in which case the denominator is the number of predictions that have at least one unmasked element).

This is similar to CategoricalAccuracy, if you’ve already done a .max() on your predictions. If you have categorical output, though, you should typically just use CategoricalAccuracy. The reason you might want to use this instead is if you’ve done some kind of constrained inference and don’t have a prediction tensor that matches the API of CategoricalAccuracy, which assumes a final dimension of size num_classes.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated accuracy.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.bleu.BLEU(ngram_weights: Iterable[float] = (0.25, 0.25, 0.25, 0.25), exclude_indices: Set[int] = None)[source]

Bilingual Evaluation Understudy (BLEU).

BLEU is a common metric used for evaluating the quality of machine translations against a set of reference translations. See Papineni et. al., “BLEU: a method for automatic evaluation of machine translation”, 2002.

Parameters
ngram_weightsIterable[float], optional (default = (0.25, 0.25, 0.25, 0.25))

Weights to assign to scores for each ngram size.

exclude_indicesSet[int], optional (default = None)

Indices to exclude when calculating ngrams. This should usually include the indices of the start, end, and pad tokens.

Notes

We chose to implement this from scratch instead of wrapping an existing implementation (such as nltk.translate.bleu_score) for a two reasons. First, so that we could pass tensors directly to this metric instead of first converting the tensors to lists of strings. And second, because functions like nltk.translate.bleu_score.corpus_bleu() are meant to be called once over the entire corpus, whereas it is more efficient in our use case to update the running precision counts every batch.

This implementation only considers a reference set of size 1, i.e. a single gold target sequence for each predicted sequence.

get_metric(self, reset:bool=False) → Dict[str, float][source]

Compute and return the metric. Optionally also call self.reset().

reset(self) → None[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.categorical_accuracy.CategoricalAccuracy(top_k: int = 1, tie_break: bool = False)[source]

Categorical Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class. Tie break enables equal distribution of scores among the classes with same maximum predicted scores.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated accuracy.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.ConllCorefScores[source]
static get_gold_clusters(gold_clusters)[source]
get_metric(self, reset:bool=False) → Tuple[float, float, float][source]

Compute and return the metric. Optionally also call self.reset().

static get_predicted_clusters(top_spans:torch.Tensor, antecedent_indices:torch.Tensor, predicted_antecedents:torch.Tensor) → Tuple[List[Tuple[Tuple[int, int], ...]], Dict[Tuple[int, int], Tuple[Tuple[int, int], ...]]][source]
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.Scorer(metric)[source]

Bases: object

Mostly borrowed from <https://github.com/clarkkev/deep-coref/blob/master/evaluation.py>

static b_cubed(clusters, mention_to_gold)[source]

Averaged per-mention precision and recall. <https://pdfs.semanticscholar.org/cfe3/c24695f1c14b78a5b8e95bcbd1c666140fd1.pdf>

static ceafe(clusters, gold_clusters)[source]

Computes the Constrained EntityAlignment F-Measure (CEAF) for evaluating coreference. Gold and predicted mentions are aligned into clusterings which maximise a metric - in this case, the F measure between gold and predicted clusters.

get_f1(self)[source]
get_precision(self)[source]
get_prf(self)[source]
get_recall(self)[source]
static muc(clusters, mention_to_gold)[source]

Counts the mentions in each predicted cluster which need to be re-allocated in order for each predicted cluster to be contained by the respective gold cluster. <https://aclweb.org/anthology/M/M95/M95-1005.pdf>

static phi4(gold_clustering, predicted_clustering)[source]

Subroutine for ceafe. Computes the mention F measure between gold and predicted mentions in a cluster.

update(self, predicted, gold, mention_to_predicted, mention_to_gold)[source]
class allennlp.training.metrics.covariance.Covariance[source]

This Metric calculates the unbiased sample covariance between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the covariance is calculated between the vectors).

This implementation is mostly modeled after the streaming_covariance function in Tensorflow. See: https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3127

The following is copied from the Tensorflow documentation:

The algorithm used for this online computation is described in https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Online . Specifically, the formula used to combine two sample comoments is C_AB = C_A + C_B + (E[x_A] - E[x_B]) * (E[y_A] - E[y_B]) * n_A * n_B / n_AB The comoment for a single batch of data is simply sum((x - E[x]) * (y - E[y])), optionally masked.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated covariance.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.drop_em_and_f1.DropEmAndF1[source]

This Metric takes the best span string computed by a model, along with the answer strings labeled in the data, and computes exact match and F1 score using the official DROP evaluator (which has special handling for numbers and for questions with multiple answer spans, among other things).

get_metric(self, reset:bool=False) → Tuple[float, float][source]
Returns
Average exact match and F1 score (in that order) as computed by the official DROP script
over all inputs.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.entropy.Entropy[source]
get_metric(self, reset:bool=False)[source]
Returns
The scalar average entropy.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.evalb_bracketing_scorer.EvalbBracketingScorer(evalb_directory_path: str = '/local/deploy/agent6/work/8feb324ce7c68d53/allennlp/tools/EVALB', evalb_param_filename: str = 'COLLINS.prm')[source]

This class uses the external EVALB software for computing a broad range of metrics on parse trees. Here, we use it to compute the Precision, Recall and F1 metrics. You can download the source for EVALB from here: <https://nlp.cs.nyu.edu/evalb/>.

Note that this software is 20 years old. In order to compile it on modern hardware, you may need to remove an include <malloc.h> statement in evalb.c before it will compile.

AllenNLP contains the EVALB software, but you will need to compile it yourself before using it because the binary it generates is system dependent. To build it, run make inside the allennlp/tools/EVALB directory.

Note that this metric reads and writes from disk quite a bit. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.

Parameters
evalb_directory_pathstr, required.

The directory containing the EVALB executable.

evalb_param_filename: str, optional (default = “COLLINS.prm”)

The relative name of the EVALB configuration file used when scoring the trees. By default, this uses the COLLINS.prm configuration file which comes with EVALB. This configuration ignores POS tags and some punctuation labels.

static clean_evalb(evalb_directory_path:str='/local/deploy/agent6/work/8feb324ce7c68d53/allennlp/tools/EVALB')[source]
static compile_evalb(evalb_directory_path:str='/local/deploy/agent6/work/8feb324ce7c68d53/allennlp/tools/EVALB')[source]
get_metric(self, reset:bool=False)[source]
Returns
The average precision, recall and f1.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.fbeta_measure.FBetaMeasure(beta: float = 1.0, average: str = None, labels: List[int] = None)[source]

Compute precision, recall, F-measure and support for each class.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

If we have precision and recall, the F-beta score is simply: F-beta = (1 + beta ** 2) * precision * recall / (beta ** 2 * precision + recall)

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

Parameters
betafloat, optional (default = 1.0)

The strength of recall versus precision in the F-score.

averagestring, [None (default), ‘micro’, ‘macro’]

If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

'micro':

Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':

Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

labels: list, optional

The set of labels to include and their order if average is None. Labels present in the data can be excluded, for example to calculate a multi-class average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average.

get_metric(self, reset:bool=False)[source]
Returns
A tuple of the following metrics based on the accumulated count statistics:
precisionsList[float]
recallsList[float]
f1-measuresList[float]
If self.average is not None, you will get float instead of List[float].
reset(self) → None[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.f1_measure.F1Measure(positive_label: int)[source]

Computes Precision, Recall and F1 with respect to a given positive_label. For example, for a BIO tagging scheme, you would pass the classification index of the tag you are interested in, resulting in the Precision, Recall and F1 score being calculated for this tag only.

get_metric(self, reset:bool=False) → Tuple[float, float, float][source]
Returns
A tuple of the following metrics based on the accumulated count statistics:
precisionfloat
recallfloat
f1-measurefloat
class allennlp.training.metrics.mean_absolute_error.MeanAbsoluteError[source]

This Metric calculates the mean absolute error (MAE) between two tensors.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated mean absolute error.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.mention_recall.MentionRecall[source]
get_metric(self, reset:bool=False) → float[source]

Compute and return the metric. Optionally also call self.reset().

reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.pearson_correlation.PearsonCorrelation[source]

This Metric calculates the sample Pearson correlation coefficient (r) between two tensors. Each element in the two tensors is assumed to be a different observation of the variable (i.e., the input tensors are implicitly flattened into vectors and the correlation is calculated between the vectors).

This implementation is mostly modeled after the streaming_pearson_correlation function in Tensorflow. See https://github.com/tensorflow/tensorflow/blob/v1.10.1/tensorflow/contrib/metrics/python/ops/metric_ops.py#L3267

This metric delegates to the Covariance metric the tracking of three [co]variances:

• covariance(predictions, labels), i.e. covariance

• covariance(predictions, predictions), i.e. variance of predictions

• covariance(labels, labels), i.e. variance of labels

If we have these values, the sample Pearson correlation coefficient is simply:

r = covariance / (sqrt(predictions_variance) * sqrt(labels_variance))

if predictions_variance or labels_variance is 0, r is 0

get_metric(self, reset:bool=False)[source]
Returns
The accumulated sample Pearson correlation.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.perplexity.Perplexity[source]

Perplexity is a common metric used for evaluating how well a language model predicts a sample.

Notes

Assumes negative log likelihood loss of each batch (base e). Provides the average perplexity of the batches.

get_metric(self, reset:bool=False) → float[source]
Returns
The accumulated perplexity.
class allennlp.training.metrics.sequence_accuracy.SequenceAccuracy[source]

Sequence Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated accuracy.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.span_based_f1_measure.SpanBasedF1Measure(vocabulary: allennlp.data.vocabulary.Vocabulary, tag_namespace: str = 'tags', ignore_classes: List[str] = None, label_encoding: Optional[str] = 'BIO', tags_to_spans_function: Optional[Callable[[List[str], Optional[List[str]]], List[Tuple[str, Tuple[int, int]]]]] = None)[source]

The Conll SRL metrics are based on exact span matching. This metric implements span-based precision and recall metrics for a BIO tagging scheme. It will produce precision, recall and F1 measures per tag, as well as overall statistics. Note that the implementation of this metric is not exactly the same as the perl script used to evaluate the CONLL 2005 data - particularly, it does not consider continuations or reference spans as constituents of the original span. However, it is a close proxy, which can be helpful for judging model performance during training. This metric works properly when the spans are unlabeled (i.e., your labels are simply “B”, “I”, “O” if using the “BIO” label encoding).

get_metric(self, reset:bool=False)[source]
Returns
A Dict per label containing following the span based metrics:
precisionfloat
recallfloat
f1-measurefloat
Additionally, an overall key is included, which provides the precision,
recall and f1-measure for all spans.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.squad_em_and_f1.SquadEmAndF1[source]

This Metric takes the best span string computed by a model, along with the answer strings labeled in the data, and computed exact match and F1 score using the official SQuAD evaluation script.

get_metric(self, reset:bool=False) → Tuple[float, float][source]
Returns
Average exact match and F1 score (in that order) as computed by the official SQuAD script
over all inputs.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.srl_eval_scorer.SrlEvalScorer(srl_eval_path: str = '/local/deploy/agent6/work/8feb324ce7c68d53/allennlp/tools/srl-eval.pl', ignore_classes: List[str] = None)[source]

This class uses the external srl-eval.pl script for computing the CoNLL SRL metrics.

AllenNLP contains the srl-eval.pl script, but you will need perl 5.x.

Note that this metric reads and writes from disk quite a bit. In particular, it writes and subsequently reads two files per __call__, which is typically invoked once per batch. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.

Parameters
srl_eval_pathstr, optional.

The path to the srl-eval.pl script.

ignore_classesList[str], optional (default=None).

A list of classes to ignore.

get_metric(self, reset:bool=False)[source]
Returns
A Dict per label containing following the span based metrics:
precisionfloat
recallfloat
f1-measurefloat
Additionally, an overall key is included, which provides the precision,
recall and f1-measure for all spans.
reset(self)[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.unigram_recall.UnigramRecall[source]

Unigram top-K recall. This does not take word order into account. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(self, reset:bool=False)[source]
Returns
The accumulated recall.
reset(self)[source]

Reset any accumulators or internal state.