A General Method Applicable to the Search for Anglicisms in Russian Social Network Texts
With the process of globalization the number of borrowings from English has rapidly increased in languages all over the world. In systems of automatic speech recognition, spell-checking, tagging and other tasks in the field of natural language processing the loan words frequently cause problems and should be treat separately. In this paper we present a corpora-based approach for the automatic detection of anglicisms in Russian social network texts. Proposed method is based on the idea of simultaneous scripting, phonetics and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcription and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain-specific area.