• A
  • A
  • A
  • ABC
  • ABC
  • ABC
  • А
  • А
  • А
  • А
  • А
Regular version of the site

Book chapter

A Complex Approach to Spellchecking and Autocorrection for Russian

P. 1-13.
Dereza O., Fenogenova A., Kayutenko D., Marakasova A.

This study discusses a number of methods that can be used jointly for error  detection and correction, namely blacklists and pre-compiled dictionaries, a word2vec model, an N-gram language model and a tripartite error model. Our system consists of two standalone modules, an error detection confidence classifier, built with the help of supervised machine learning methods, and a corrector that processes words flagged as misspellings by the classifier. The error detection classifier uses word2vec filtered vector scores as one of the features. Apart from that, to achieve higher accuracy while having little training data, we use a hybrid error model that combines three approaches: the traditional channel model that uses single letter edits, the model introduced by Brill and Moore, and an extended version of the channel model that uses wider context edits. Combining these tools and methods we achieved rather promising results: our system effectively handles both known and unknown words, including difficult cases such as slang.