allennlp.training.metrics

A Metric is some quantity or quantities that can be accumulated during training or evaluation; for example, accuracy or F1 score.

class allennlp.training.metrics.metric.Metric[source]

Bases: allennlp.common.registrable.Registrable

A very general abstract class representing a metric which can be accumulated.

classmethod from_params(params: allennlp.common.params.Params, vocab: typing.Union[allennlp.data.vocabulary.Vocabulary, NoneType] = None)[source]
get_metric(reset: bool) → typing.Union[float, typing.Tuple[float, ...], typing.Dict[str, float]][source]

Compute and return the metric. Optionally also call self.reset().

reset() → None[source]

Reset any accumulators or internal state.

static unwrap_to_tensors(*tensors)[source]

If you actually passed in Variables to a Metric instead of Tensors, there will be a huge memory leak, because it will prevent garbage collection for the computation graph. This method ensures that you’re using tensors directly and that they are on the CPU.

class allennlp.training.metrics.average.Average → None[source]

Bases: allennlp.training.metrics.metric.Metric

This Metric breaks with the typical Metric API and just stores values that were computed in some fashion outside of a Metric. If you have some external code that computes the metric for you, for instance, you can use this to report the average result using our Metric API.

get_metric(reset: bool = False)[source]
Returns:The average of all values that were passed to __call__.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.boolean_accuracy.BooleanAccuracy → None[source]

Bases: allennlp.training.metrics.metric.Metric

Just checks batch-equality of two tensors and computes an accuracy metric based on that. This is similar to CategoricalAccuracy, if you’ve already done a .max() on your predictions. If you have categorical output, though, you should typically just use CategoricalAccuracy. The reason you might want to use this instead is if you’ve done some kind of constrained inference and don’t have a prediction tensor that matches the API of CategoricalAccuracy, which assumes a final dimension of size num_classes.

get_metric(reset: bool = False)[source]
Returns:The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.categorical_accuracy.CategoricalAccuracy(top_k: int = 1) → None[source]

Bases: allennlp.training.metrics.metric.Metric

Categorical Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(reset: bool = False)[source]
Returns:The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.entropy.Entropy → None[source]

Bases: allennlp.training.metrics.metric.Metric

get_metric(reset: bool = False)[source]
Returns:The scalar average entropy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.f1_measure.F1Measure(positive_label: int) → None[source]

Bases: allennlp.training.metrics.metric.Metric

Computes Precision, Recall and F1 with respect to a given positive_label. For example, for a BIO tagging scheme, you would pass the classification index of the tag you are interested in, resulting in the Precision, Recall and F1 score being calculated for this tag only.

get_metric(reset: bool = False)[source]
Returns:

A tuple of the following metrics based on the accumulated count statistics:

precision : float

recall : float

f1-measure : float

reset()[source]
class allennlp.training.metrics.span_based_f1_measure.SpanBasedF1Measure(vocabulary: allennlp.data.vocabulary.Vocabulary, tag_namespace: str = 'tags', ignore_classes: typing.List[str] = None) → None[source]

Bases: allennlp.training.metrics.metric.Metric

The Conll SRL metrics are based on exact span matching. This metric implements span-based precision and recall metrics for a BIO tagging scheme. It will produce precision, recall and F1 measures per tag, as well as overall statistics. Note that the implementation of this metric is not exactly the same as the perl script used to evaluate the CONLL 2005 data - particularly, it does not consider continuations or reference spans as constituents of the original span. However, it is a close proxy, which can be helpful for judging model peformance during training.

get_metric(reset: bool = False)[source]
Returns:

A Dict per label containing following the span based metrics:

precision : float

recall : float

f1-measure : float

Additionally, an overall key is included, which provides the precision,

recall and f1-measure for all spans.

reset()[source]
class allennlp.training.metrics.squad_em_and_f1.SquadEmAndF1 → None[source]

Bases: allennlp.training.metrics.metric.Metric

This Metric takes the best span string computed by a model, along with the answer strings labeled in the data, and computed exact match and F1 score using the official SQuAD evaluation script.

get_metric(reset: bool = False) → typing.Tuple[float, float][source]
Returns:

Average exact match and F1 score (in that order) as computed by the official SQuAD script

over all inputs.

reset()[source]

Reset any accumulators or internal state.