RuBQ 2.0: An Innovated Russian Question Answering Dataset
The paper describes the second version of RuBQ, a Russian dataset for knowledge base question answering (KBQA) over Wikidata. Whereas the first version builds on Q&A pairs harvested online, the extension is based on questions obtained through search engine query suggestion services. The questions underwent crowdsourced and in-house annotation in a quite different fashion compared to the first edition. The dataset doubled in size: RuBQ 2.0 contains 2,910 questions along with the answers and SPARQL queries. The dataset also incorporates answer-bearing paragraphs from Wikipedia for the majority of questions. The dataset is suitable for the evaluation of KBQA, machine reading comprehension (MRC), hybrid questions answering, as well as semantic parsing. We provide the analysis of the dataset and report several KBQA and MRC baseline results. The dataset is freely available under the CC-BY-4.0 license.