A Question Understanding Benchmark

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

For more details on Break, please refer to our TACL 2020 paper, and to our blogpost.

Question-Answering Datasets

The Break dataset contains questions from the following 10 datasets:

For the full dataset statistics please refer to our repository.

Paper

Break It Down: A Question Understanding Benchmark
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch and Jonathan Berant
Transactions of the Association for Computational Linguistics (TACL), 2020

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

Authors

Talent wins games, but teamwork and intelligence wins championships.

Michael Jordan

Leaderboard

Submission

Evaluating predictions for the hidden test set is done via the AI2 Leaderboard page. Log on to the leaderboard website and follow the submission instructions.

Given the GED metric is computed by an approximation algorithm, the evaluation may take several hours. The approximation algorithm also results in slightly different GED values than the paper.

Results

Break

Rank Submission Created EM Dev. EM Test SARI Dev. SARI Test GED Dev. GED Test
1 Curriculum-trained CopyNet
Chris Coleman and Alex Reneau,
Northwestern University
Jul 1, 2020 _ 0.163 _ 0.757 _ 0.271
2 CopyNet
(Wolfson et al., TACL 2020)
Feb 1, 2020 0.154 0.157 0.748 0.746 0.318 0.322
3 RuleBased
(Wolfson et al., TACL 2020)
Feb 1, 2020 0.002 0.003 0.508 0.506 0.799 0.802

Break High-level

Rank Submission Created EM Dev. EM Test SARI Dev. SARI Test GED Dev. GED Test
1 CopyNet
(Wolfson et al., TACL 2020)
Feb 1, 2020 0.081 0.083 0.722 0.722 0.319 0.316
2 RuleBased
(Wolfson et al., TACL 2020)
Feb 1, 2020 0.010 0.012 0.554 0.554 0.659 0.652

Explore

To view (many) more question decomposition examples, explore Break.

Download