A Question Understanding Benchmark

Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.

For more details on Break, please refer to our TACL 2020 paper, and to our blogpost.

Question-Answering Datasets

The Break dataset contains questions from the following 10 datasets:

Semantic Parsing: Academic, ATIS, GeoQuery, Spider
Visual Question Answering: CLEVR-humans, NLVR2
Reading Comprehension (and KB-QA): ComQA, ComplexWebQuestions, DROP, HotpotQA

For the full dataset statistics please refer to our repository.

Paper

Break It Down: A Question Understanding Benchmark
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch and Jonathan Berant
Transactions of the Association for Computational Linguistics (TACL), 2020

@article{Wolfson2020Break,
  title={Break It Down: A Question Understanding Benchmark},
  author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
  journal={Transactions of the Association for Computational Linguistics},
  year={2020},
}

Authors

Talent wins games, but teamwork and intelligence wins championships.

Michael Jordan

Tomer Wolfson

Mor
Geva

Ankit Gupta

Matt Gardner

Yoav Goldberg

Daniel Deutch

Jonathan Berant

Leaderboard

Submission

Evaluating predictions for the hidden test set is done via the AI2 Leaderboard page. Log on to the leaderboard website and follow the submission instructions.

Break Leaderboard
Break High-Level Leaderboard

Given the GED metric is computed by an approximation algorithm, the evaluation may take several hours. The approximation algorithm also results in slightly different GED values than the paper.

Results

Break

Rank	Submission	Created	EM Dev.	EM Test	SARI Dev.	SARI Test	GED Dev.	GED Test
1	Curriculum-trained CopyNet Chris Coleman and Alex Reneau, Northwestern University	Jul 1, 2020	`_`	`0.163`	`_`	`0.757`	`_`	`0.271`
2	CopyNet (Wolfson et al., TACL 2020)	Feb 1, 2020	`0.154`	`0.157`	`0.748`	`0.746`	`0.318`	`0.322`
3	RuleBased (Wolfson et al., TACL 2020)	Feb 1, 2020	`0.002`	`0.003`	`0.508`	`0.506`	`0.799`	`0.802`

Break High-level

Rank	Submission	Created	EM Dev.	EM Test	SARI Dev.	SARI Test	GED Dev.	GED Test
1	CopyNet (Wolfson et al., TACL 2020)	Feb 1, 2020	`0.081`	`0.083`	`0.722`	`0.722`	`0.319`	`0.316`
2	RuleBased (Wolfson et al., TACL 2020)	Feb 1, 2020	`0.010`	`0.012`	`0.554`	`0.554`	`0.659`	`0.652`

Explore

To view (many) more question decomposition examples, explore Break.

Download

For the full documentation of the dataset and its format please refer to our Github repository.
Click here to download Break.