This page collects NLU datasets proposed in 2018.
| Dataset | task | style | size | source | where | web | misc | similar datasets |
|---|---|---|---|---|---|---|---|---|
| CoQA | RC | free form (+no ans) | 127k | various articles | TACL? | url | conversational questions | QuAC |
| QuAC | RC | extraction (+no ans) | 100k | Wikipedia | EMNLP2018 | url | conversational questions | CoQA |
| HotpotQA | RC | extraction | 113k | Wikipedia | EMNLP2018 | url | multi-hop reasoning | QAngaroo |
| SWAG | QA | multiple choice | 113k | video caption | EMNLP2018 | url | situational commonsense reasoning | |
| DNC | NLI | textual entailment | 570k | NLP tasks | EMNLP2018 | url | diverse NLI | SNLI, MultiNLI |
| OpenBookQA | QA | multiple choice | 6k | science facts | EMNLP2018 | url | external knowledge | ARC |
| RecipeQA | RC+ | various | 36k | recipe | EMNLP2018 | url | multimodal comprehension | TextbookQA, FigureQA |
| CLOTH | RC | cloze | 99k | English exams | EMNLP2018 | url | RACE | |
| DuoRC | RC | extraction | 186k | movie plot | ACL2018 | url | NarrativeQA | |
| SQuAD2.0 | RC | extraction (+no ans) | 150k | Wikipedia | ACL2018 | url | no answer: 50k | NewsQA |
| CliCR | RC | cloze | 100k | clinical case text | NAACL2018 | url | ||
| FEVER | NLI? | fact verification | 185k | Wikipedia | NAACL2018 | url | ||
| MultiRC | RC | multiple choice | 6k+ | various articles | NAACL2018 | url | multiple sentence reasoning | MCTest |
| ProPara | RC | various | 2k | procedural text | NAACL2018 | url | bAbI, SCoNE | |
| ARC | RC | multiple choice | 8k | science exam | ? | url | easy 5197, challenge 2590 |
TODO:
- Interpretation of Natural Language Rules in Conversational Machine Reading (Saeidi+ 2018, EMNLP)
- Multi-Relational Question Answering from Narratives: Machine Reading and Reasoning in Simulated Worlds (Labutov+ 2018, ACL)
- Event2Mind: Commonsense Inference on Events, Intents, and Reactions (Rashkin+ 2018, ACL)
- Modeling Naive Psychology of Characters in Simple Commonsense Stories (Rashkin+ 2018, ACL)
- emrQA: A Large Corpus for Question Answering on Electronic Medical Records (Pampari+ 2018, EMNLP)
Note:
- QA = question answering, RC = reading comprehension = question answering with the context, NLI = natural language inference aka recognizing textual entailment