A Question Understanding Benchmark
Break is a question understanding dataset, aimed at training models to reason over complex questions. It features 83,978 natural language questions, annotated with a new meaning representation, Question Decomposition Meaning Representation (QDMR). Each example has the natural question along with its QDMR representation. Break contains human composed questions, sampled from 10 leading question-answering benchmarks over text, images and databases. This dataset was created by a team of NLP researchers at Tel Aviv University and Allen Institute for AI.
For more details on Break, please refer to our TACL 2020 paper, and to our blogpost.
Question-Answering Datasets
The Break dataset contains questions from the following 10 datasets:
- Semantic Parsing: Academic, ATIS, GeoQuery, Spider
- Visual Question Answering: CLEVR-humans, NLVR2
- Reading Comprehension (and KB-QA): ComQA, ComplexWebQuestions, DROP, HotpotQA
For the full dataset statistics please refer to our repository.
Paper
Break It Down: A Question Understanding Benchmark
Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch and Jonathan Berant
Transactions of the Association for Computational Linguistics (TACL), 2020
@article{Wolfson2020Break,
title={Break It Down: A Question Understanding Benchmark},
author={Wolfson, Tomer and Geva, Mor and Gupta, Ankit and Gardner, Matt and Goldberg, Yoav and Deutch, Daniel and Berant, Jonathan},
journal={Transactions of the Association for Computational Linguistics},
year={2020},
}
Authors
Talent wins games, but teamwork and intelligence wins championships.
Michael Jordan
Leaderboard
Submission
Evaluating predictions for the hidden test set is done via the AI2 Leaderboard page. Log on to the leaderboard website and follow the submission instructions.
Given the GED metric is computed by an approximation algorithm, the evaluation may take several hours. The approximation algorithm also results in slightly different GED values than the paper.
Results
Break
Rank | Submission | Created | EM Dev. | EM Test | SARI Dev. | SARI Test | GED Dev. | GED Test |
---|---|---|---|---|---|---|---|---|
1 | Curriculum-trained CopyNet Chris Coleman and Alex Reneau, Northwestern University |
Jul 1, 2020 | _ |
0.163 |
_ |
0.757 |
_ |
0.271 |
2 | CopyNet (Wolfson et al., TACL 2020) |
Feb 1, 2020 | 0.154 |
0.157 |
0.748 |
0.746 |
0.318 |
0.322 |
3 | RuleBased (Wolfson et al., TACL 2020) |
Feb 1, 2020 | 0.002 |
0.003 |
0.508 |
0.506 |
0.799 |
0.802 |
Break High-level
Rank | Submission | Created | EM Dev. | EM Test | SARI Dev. | SARI Test | GED Dev. | GED Test |
---|---|---|---|---|---|---|---|---|
1 | CopyNet (Wolfson et al., TACL 2020) |
Feb 1, 2020 | 0.081 |
0.083 |
0.722 |
0.722 |
0.319 |
0.316 |
2 | RuleBased (Wolfson et al., TACL 2020) |
Feb 1, 2020 | 0.010 |
0.012 |
0.554 |
0.554 |
0.659 |
0.652 |
Explore
To view (many) more question decomposition examples, explore Break.
Download
- For the full documentation of the dataset and its format please refer to our Github repository.
- Click here to download Break.