RuCoLA: Russian Corpus of Linguistic Acceptability

Mikhailov V.; Shamardina T.; M. Ryabinin; A. Pestova; Smurov I.; E. Artemova

?

RuCoLA: Russian Corpus of Linguistic Acceptability

P. 5207–5227.

Mikhailov V., Shamardina T., Ryabinin M., Pestova A., Smurov I., Artemova E.

Linguistic acceptability (LA) attracts the attention of the research community due to its many uses, such as testing the grammatical knowledge of language models and filtering implausible texts with acceptability classifiers. However, the application scope of LA in languages other than English is limited due to the lack of high-quality resources. To this end, we introduce the Russian Corpus of Linguistic Acceptability (RuCoLA), built from the ground up under the well-established binary LA approach. RuCoLA consists of k in-domain sentences from linguistic publications and k out-of-domain sentences produced by generative models. The out-of-domain set is created to facilitate the practical use of acceptability for improving language generation. Our paper describes the data collection protocol and presents a fine-grained analysis of acceptability classification experiments with a range of baseline approaches. In particular, we demonstrate that the most widely used language models still fall behind humans by a large margin, especially when detecting morphological and semantic errors. We release RuCoLA, the code of experiments, and a public leaderboard (rucola-benchmark.com) to assess the linguistic competence of language models for Russian.

Language: English

Full text

Text on another site

Keywords: грамматичность Linguistic Acceptability Language Model

Publication based on the results of:

Development of mathematical models and methods for natural language processing, knowledge discovery in data and recommender systems (2022)

In book

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Association for Computational Linguistics, 2022.

A Language Model for Grammatical Error Correction in L2 Russian

Remnev N., Obiedkov S., Rakhilina E. V. et al., / Series Computer Science "arxiv.org". 2023.

Grammatical error correction is one of the fundamental tasks in Natural Language Processing. For the Russian language, most of the spellcheckers available correct typos and other simple errors with high accuracy, but often fail when faced with non-native (L2) writing, since the latter contains errors that are not typical for native speakers. In this paper, ...

Added: October 30, 2024

Corpora as indicators of (non-)existence

Piperski A., , in: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015).: М.: Изд-во РГГУ, 2015. P. 494–500.

This paper discusses the notions of acceptability, occurrence, grammaticality and existence, and focuses on the relationship between corpus linguistics and the question of the existence of lexical items. Since corpora are almost exclusively samples from larger populations, it is claimed that they cannot provide evidence for non-existence of words, collocations or constructions. This is because ...

Added: March 13, 2016