Автоматическое обнаружение и исправление деривационных ошибок в письменной речи на русском как иностранном
Learner corpora serve as one of the most valuable sources of statistical data on learners' errors. For instance, data from foreign-language learners’ corpora can be used for the Second Language Acquisition research. However, corpora representativity strongly depends on the quality of its error markup, which is most frequently carried out manually and thus presents a time-consuming and painstaking routine for the annotators. To make annotation process easier, addi-tional tools, such as spellcheckers, are usually used. This paper focuses on developing a program for automatic correc-tion of derivational errors made by learners of Russian as a foreign language. Derivational errors, which are not com-mon for adult Russian native speakers (L1), but occur quite often in written texts or speech of Russian as foreign language learners (L2) [Chernigovskaya, Gor, 2000], were chosen as scope of our research because correction of such mistakes presents a formidable challenge for existing spellcheckers. Using the data from the Russian Learner Corpus (http://www.web-corpora.net/RLC/), we tested two already existing approaches to solve such kind of problems. The first one is based on a finite state automaton principle developed by Dickinson and Herring 2008, and it was test-ed as algorithm for derivational errors detection. The second one which relies on the Noisy Channel model by Brill and Moore, 2000, was used for studying errors correction. After we analyzed effectiveness of these tests, we devel-oped our own system for autocorrection of derivational errors. In our program the algorithm of Dickinson and Herring was used as word-formation error detection module. The Noisy Channel model has been rejected, and we decided to use instead the Continuous Bag of Words FastText model, based on Harris distributional semantics theory . In addition, filtering rules have been developed for correcting frequent errors that the model is unable to handle. To restore automatically the correct grammatical word form, dictionary of word paradigms is used. Model results were validated on the data of Russian Learner Corpus.