allennlp.training.metrics

A Metric is some quantity or quantities that can be accumulated during training or evaluation; for example, accuracy or F1 score.

class allennlp.training.metrics.metric.Metric[source]

Bases: allennlp.common.registrable.Registrable

A very general abstract class representing a metric which can be accumulated.

classmethod from_params(params: allennlp.common.params.Params, vocab: typing.Union[allennlp.data.vocabulary.Vocabulary, NoneType] = None)[source]
get_metric(reset: bool) → typing.Union[float, typing.Tuple[float, ...], typing.Dict[str, float]][source]

Compute and return the metric. Optionally also call self.reset().

reset() → None[source]

Reset any accumulators or internal state.

static unwrap_to_tensors(*tensors)[source]

If you actually passed in Variables to a Metric instead of Tensors, there will be a huge memory leak, because it will prevent garbage collection for the computation graph. This method ensures that you’re using tensors directly and that they are on the CPU.

class allennlp.training.metrics.average.Average → None[source]

Bases: allennlp.training.metrics.metric.Metric

This Metric breaks with the typical Metric API and just stores values that were computed in some fashion outside of a Metric. If you have some external code that computes the metric for you, for instance, you can use this to report the average result using our Metric API.

get_metric(reset: bool = False)[source]
Returns:The average of all values that were passed to __call__.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.boolean_accuracy.BooleanAccuracy → None[source]

Bases: allennlp.training.metrics.metric.Metric

Just checks batch-equality of two tensors and computes an accuracy metric based on that. This is similar to CategoricalAccuracy, if you’ve already done a .max() on your predictions. If you have categorical output, though, you should typically just use CategoricalAccuracy. The reason you might want to use this instead is if you’ve done some kind of constrained inference and don’t have a prediction tensor that matches the API of CategoricalAccuracy, which assumes a final dimension of size num_classes.

get_metric(reset: bool = False)[source]
Returns:The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.categorical_accuracy.CategoricalAccuracy(top_k: int = 1) → None[source]

Bases: allennlp.training.metrics.metric.Metric

Categorical Top-K accuracy. Assumes integer labels, with each item to be classified having a single correct class.

get_metric(reset: bool = False)[source]
Returns:The accumulated accuracy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.entropy.Entropy → None[source]

Bases: allennlp.training.metrics.metric.Metric

get_metric(reset: bool = False)[source]
Returns:The scalar average entropy.
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.f1_measure.F1Measure(positive_label: int) → None[source]

Bases: allennlp.training.metrics.metric.Metric

Computes Precision, Recall and F1 with respect to a given positive_label. For example, for a BIO tagging scheme, you would pass the classification index of the tag you are interested in, resulting in the Precision, Recall and F1 score being calculated for this tag only.

get_metric(reset: bool = False)[source]
Returns:

A tuple of the following metrics based on the accumulated count statistics:

precision : float

recall : float

f1-measure : float

reset()[source]
class allennlp.training.metrics.span_based_f1_measure.SpanBasedF1Measure(vocabulary: allennlp.data.vocabulary.Vocabulary, tag_namespace: str = 'tags', ignore_classes: typing.List[str] = None) → None[source]

Bases: allennlp.training.metrics.metric.Metric

The Conll SRL metrics are based on exact span matching. This metric implements span-based precision and recall metrics for a BIO tagging scheme. It will produce precision, recall and F1 measures per tag, as well as overall statistics. Note that the implementation of this metric is not exactly the same as the perl script used to evaluate the CONLL 2005 data - particularly, it does not consider continuations or reference spans as constituents of the original span. However, it is a close proxy, which can be helpful for judging model peformance during training.

get_metric(reset: bool = False)[source]
Returns:

A Dict per label containing following the span based metrics:

precision : float

recall : float

f1-measure : float

Additionally, an overall key is included, which provides the precision,

recall and f1-measure for all spans.

reset()[source]
class allennlp.training.metrics.squad_em_and_f1.SquadEmAndF1 → None[source]

Bases: allennlp.training.metrics.metric.Metric

This Metric takes the best span string computed by a model, along with the answer strings labeled in the data, and computed exact match and F1 score using the official SQuAD evaluation script.

get_metric(reset: bool = False) → typing.Tuple[float, float][source]
Returns:

Average exact match and F1 score (in that order) as computed by the official SQuAD script

over all inputs.

reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.mention_recall.MentionRecall → None[source]

Bases: allennlp.training.metrics.metric.Metric

get_metric(reset: bool = False) → float[source]

Compute and return the metric. Optionally also call self.reset().

reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.ConllCorefScores → None[source]

Bases: allennlp.training.metrics.metric.Metric

static get_gold_clusters(gold_clusters)[source]
get_metric(reset: bool = False) → typing.Tuple[[float, float], float][source]

Compute and return the metric. Optionally also call self.reset().

static get_predicted_clusters(top_spans, antecedent_indices, predicted_antecedents)[source]
reset()[source]

Reset any accumulators or internal state.

class allennlp.training.metrics.conll_coref_scores.Scorer(metric)[source]

Bases: object

Mostly borrowed from <https://github.com/clarkkev/deep-coref/blob/master/evaluation.py>

static b_cubed(clusters, mention_to_gold)[source]

Averaged per-mention precision and recall. <https://pdfs.semanticscholar.org/cfe3/c24695f1c14b78a5b8e95bcbd1c666140fd1.pdf>

static ceafe(clusters, gold_clusters)[source]

Computes the Constrained EntityAlignment F-Measure (CEAF) for evaluating coreference. Gold and predicted mentions are aligned into clusterings which maximise a metric - in this case, the F measure between gold and predicted clusters.

<https://www.semanticscholar.org/paper/On-Coreference-Resolution-Performance-Metrics-Luo/de133c1f22d0dfe12539e25dda70f28672459b99>

get_f1()[source]
get_precision()[source]
get_prf()[source]
get_recall()[source]
static muc(clusters, mention_to_gold)[source]

Counts the mentions in each predicted cluster which need to be re-allocated in order for each predicted cluster to be contained by the respective gold cluster. <http://aclweb.org/anthology/M/M95/M95-1005.pdf>

static phi4(gold_clustering, predicted_clustering)[source]

Subroutine for ceafe. Computes the mention F measure between gold and predicted mentions in a cluster.

update(predicted, gold, mention_to_predicted, mention_to_gold)[source]
class allennlp.training.metrics.evalb_bracketing_scorer.EvalbBracketingScorer(evalb_directory_path: str, evalb_param_filename: str = 'COLLINS.prm') → None[source]

Bases: allennlp.training.metrics.metric.Metric

This class uses the external EVALB software for computing a broad range of metrics on parse trees. Here, we use it to compute the Precision, Recall and F1 metrics. You can download the source for EVALB from here: <http://nlp.cs.nyu.edu/evalb/>.

Note that this software is 20 years old. In order to compile it on modern hardware, you may need to remove an include <malloc.h> statement in evalb.c before it will compile.

AllenNLP contains the EVALB software, but you will need to compile it yourself before using it because the binary it generates is system depenedent. To build it, run make inside the scripts/EVALB directory.

Note that this metric reads and writes from disk quite a bit. You probably don’t want to include it in your training loop; instead, you should calculate this on a validation set only.

Parameters:

evalb_directory_path : str, required.

The directory containing the EVALB executable.

evalb_param_filename: ``str``, optional (default = “COLLINS.prm”)

The relative name of the EVALB configuration file used when scoring the trees. By default, this uses the COLLINS.prm configuration file which comes with EVALB. This configuration ignores POS tags and some punctuation labels.

get_metric(reset: bool = False)[source]
Returns:The average precision, recall and f1.
reset()[source]

Reset any accumulators or internal state.