Comparative Analysis of Anglicism Distribution in Russian Social Network Texts
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing, loan words are not easily recognized and should be evaluated separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network texts. Proposed method is based on the idea of simultaneous scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.