• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Modeling lemma frequency bands for lexical complexity assessment of Russian texts
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 5, 2026
Neural Network Maps as a Method for Constructing Mathematical Models
Scientists from HSE University–Nizhny Novgorod and the Institute of Physics Belgrade, Serbia, are jointly exploring the application of machine learning techniques and neural networks to the study of nonlinear dynamics. Natalya Stankevich, Leading Research Fellow at the Laboratory of Topological Methods in Dynamics of the Faculty of Informatics, Mathematics, and Computer Science at HSE University–Nizhny Novgorod, spoke to the HSE News Service about this international project.
June 5, 2026
‘In the Age of Technology, It Is Interesting to Look into the Past and Think about What We Can Take from It
Polina Tabakova decided to apply for a Philology degree at HSE in Nizhny Novgorod because she grew up in Mari El and did not want to move far away from the Russian forests. In an interview for the Young Scientists of HSE University project, she spoke about the genre of the campus novel, the existential drama of Kolobok, and a blackout version of Eugene Onegin.
June 5, 2026
HSE Scientists Develop Method to Compress Large Language Models Without Losing Quality
Researchers from the AI and Digital Science Institute at the HSE Faculty of Computer Science have developed a new compression method for large language models such as GPT and LLaMA that reduces their size by 25–36% without additional training or significant loss of accuracy. This is the first approach to use mathematical transformations—specifically, rotations of model weights—to make models more amenable to compression with structured matrices. The study results have been published in ACL Findings 2025. The code is available on GitHub.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Modeling lemma frequency bands for lexical complexity assessment of Russian texts

P. 76–92.
Blinova O. V., Tarasov N., Blekanov I., Modina V.

The paper is devoted to the problem of modeling general-language frequency using data of large Russian corpora. Our goal is to develop a methodology for forming a consolidated frequency list which in the future can be used for assessing lexical complexity of Russian texts.
We compared 4 frequency lists developed from 4 corpora (Russian National Corpus, ruTenTen11, Araneum Russicum III Maximum, Taiga). Firstly, we applied rank correlation analysis. Secondly, we used the measures “coverage” and “enrichment”. Thirdly, we applied the measure “sum of minimal frequencies”. We found that there are significant differences between the compared frequency lists both in ranking and in relative frequencies. The application of the “coverage” measure showed that frequency lists are by no means substitutable. Therefore, none of the corpora in question can be excluded when compiling a consolidated frequency list.
For a more detailed comparison of frequency lists for different frequency bands, the ranked frequency list, based on RNC data, was divided into 4 equal parts. Then 4 random samples (containing 20 lemmas from each quartile) were formed.
Due to the wide range of values, accepted by ipm measure, relative frequency values are difficult to interpret. In addition, there are no reliable thresholds separating high-frequency, mid-frequency, and low-frequency lemmas. Meanwhile, to assess the lexical complexity of texts, it is useful to have a convenient way of distributing lemmas with certain frequencies over the bands of the frequency list. Therefore, we decided to assign lemmas “Zipf-values”, which made the frequency data interpretable because the range of measure values is small.
The result of our work will be a publicly accessible reference resource called “Frequentator”, which will allow to obtain interpretable information about the frequency of Russian words.

The presented research was supported by the Russian Science Foundation, project #19-18-00525 “Understanding official Russian: the legal and linguistic issues”.

Language: English
Full text
DOI
Text on another site
Keywords: русский языкRussiancorporalexical complexityлексическая сложностьчастотный список леммобщеязыковая частотностьнизкочастотные слова lemma frequency listsgeneral-language frequency frequency bands low-frequency wordsязыковые корпусызоны частотного списка

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.)
Вып. 19(26). , М.: Изд-во РГГУ, 2020.
Similar publications
Juxtapositional vs. possessive-like encoding in Russian specificational constructions
Logvinova N., Russian linguistics 2026 Vol. 50 Article 11
This paper presents the first in-depth corpus-based study of a previously overlooked syntactic variation in Russian: the competition between juxtapositional (Nominative) and possessive-like (Genitive) encoding of the second noun (the term) in specificational constructions (e.g., ponjatie čest’ (notion.NOM honor.NOM) vs. ponjatie česti (notion.NOMhonor.GEN) ‘the notion of honor’). While typological research has established cross-linguistic preferences for one encoding strategy over another, intralinguistic variation ...
Added: May 18, 2026
Речевые акты с вежливыми диминутивами: жанровые и дискурсивные особенности
Fufaeva I., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Т. 24 № 4 С. 78–90
The study delves into speech acts with diminutives used for politeness, focusing on their discursive and genre-related aspects. It draws on authorial recordings of colloquial speech, data from the National Corpus of the Russian Language, and recordings of urban speech from the 1970s and late twentieth century. The research highlights the potential usage of polite ...
Added: May 2, 2026
Listen, Repeat, Decide: Investigating Pronunciation Variation in Spoken Word Recognition among Russian Speakers
Zubov V., Elena Riekhakaynen, , in: Proceedings of the Workshop on Cognitive Aspects of the Lexicon @ LREC-COLING 2024.: European Language Resources Association (ELRA), 2024. P. 129–132.
Variability is one of the important features of natural speech and a challenge for spoken word recognition models and automatic speech recognition systems. We conducted two preliminary experiments aimed at finding out whether native Russian speakers regard differently certain types of pronunciation variation when the variants are equally possible according to orthoepic norms. In the ...
Added: April 19, 2026
Дискриминативная лемматизация сокращений в эпоху LLM
Глазкова А. В., Смаль И. В., Lyashevskaya O. et al., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2025 Т. 527 С. 146–155
This paper presents a study on the effectiveness of discriminative methods for abbreviation lemmatization in Russian texts. Unlike generative approaches, discriminative models select the optimal lemma from a fixed set of candidates, eliminating the risk of generating grammatically incorrect word forms. For the first time in Russian language processing, we conduct a comprehensive analysis of ...
Added: March 10, 2026
Rubic2: Ensemble Model for Russian Lemmatization
Afanasev I., Glazkova A., Lyashevskaya O. et al., , in: Proceedings of the 10th Workshop on Slavic Natural Language Processing (Slavic NLP 2025).: Association for Computational Linguistics, 2025. P. 157–170.
Pre-trained language models have significantly advanced natural language processing (NLP), particularly in analyzing languages with complex morphological structures. This study addresses lemmatization for the Russian language, the errors in which can critically affect the performance of information retrieval, question answering, and other tasks. We present the results of experiments on generative lemmatization using pre-trained language ...
Added: March 10, 2026
Transformer-based approaches for lemmatizing abbreviations in Russian texts
Glazkova A., Lyashevskaya O., Morozov D. et al., Journal of Mathematical Sciences 2025 Vol. 546 P. 32–47
This paper addresses the task of lemmatizing abbreviations in the Russian language. Abbreviation lemmatization is particularly challenging, as it involves not only transforming a word into its normal form but also correctly expanding the abbreviation. We explore two approaches to this task, both leveraging large pretrained language models. The first approach is generative, where the ...
Added: March 10, 2026
Говорящий и пишущий: К 100-летию со дня рождения Татьяны Григорьевны Винокур
М.: Институт русского языка им. В.В. Виноградова РАН, 2024.
The book is dedicated to the memory of a remarkable Russian language scholar, Tatyana Grigoryevna Vinokur (1924–1992). The range of issues addressed in the collected scholarly articles reflects the breadth of Tatyana Grigoryevna's research interests: the history of language, poetics, the language of fiction, stylistics, speech culture, problems of communication studies, and many other topics. ...
Added: March 8, 2026
Difference in Language Profiles of Children With Autism Spectrum Disorder and Down Syndrome Is Not Driven by Non-Verbal Cognition
Novoselova K., Lopukhina A., Gomozova M. et al., International Journal of Language and Communication Disorders 2026 Vol. 61 No. 1 Article e70177
Background Autism Spectrum Disorder (ASD) and Down syndrome (DS) are among the most common types of neurodevelopmental conditions that have co-occurring language impairments. Usually, non-verbal IQ has been reported as one of the main predictors of language functioning in children with these conditions. Although language abilities of children with ASD and DS have been described in ...
Added: February 6, 2026
Правовое положение соотечественников, проживающих в постсоветских странах, в условиях нестабильной международной обстановки
Затулин К. Ф., Егоров В. Г., Докучаева А. В. et al., М.: Институт диаспоры и интеграции (Институт стран СНГ), 2025.
Книга «Правовое положение соотечественников, проживающих в постсоветских странах, в условиях нестабильной международной обстановки» содержит результаты исследования, проведенного в Абхазии, Азербайджане, Армении, Беларуси, Грузии, Казахстане, Киргизии, Латвии, Литве, Молдове, Приднестровской Молдавской Республике, Таджикистане, Узбекистане, Эстонии и Южной Осетии. Исследование выполнено Институтом диаспоры и интеграции (Институтом стран СНГ) в 2024 году. Оно включило в себя анализ нормативно-правовых ...
Added: February 3, 2026
Методика обучения младших школьников чтению на русском и английском языках: сходство и различие
[б.и.], 2022.
The article highlights the importance of the role of teaching reading to children, its specific features and components; the main methods used in teaching reading to children both in Russian and in English are considered; a comparative characteristic of the two languages is made. In addition, the article also compares the methods of teaching reading ...
Added: January 31, 2026
Некоторые модификации к теории связанных употреблений индексальных выражений И. Басси
Tiskin D., Типология морфосинтаксических параметров 2024 Т. 7 № 1 С. 107–123
Fake indexicals (FIs), or bound-variable uses of e.g. 1st - and 2 nd -person pronouns, have been analysed by Bassi (2021) as arising from a post-syntactic process of inspecting the features of the referent. This leads to a peculiar analysis of the syntax and semantics of relative clauses containing FIs. I argue for a more ...
Added: January 26, 2026
Experimental evidence suggests that null complement anaphora in Russian is not reducible to clausal ellipsis
Knyazev M., Folia Linguistica 2026 Vol. 60 No. 1 P. 453–496
Null complement anaphora, NCA (e.g., I suggested the price was too high, and she agreed ∅.), is a long known but poorly understood phenomenon subject to idiosyncratic lexical restrictions. In languages like Russian, however, it is (or appears) productive, with verbs not allowing NCA hard to nd, raising the question whether omission of the clausal argument ...
Added: January 19, 2026
Null and overt subjects in Russian polarity focus: Interactions with ellipsis
Kasenov D., Rudnev P., , in: Экспериментальные исследования языка: материалы конференции 2025.: М.: Наш мир, 2025. P. 50–53.
Added: January 19, 2026
Переводы вьетнамской художественной литературы на русский язык вьетнамских русистов как отражение типологических и культурологических различий русского и вьетнамского языков
Britov I., В кн.: Русский язык и русская культура во Вьетнаме: проблемы обучения и исследования.: Ханой: Ханойский государственный университет, 2025. С. 135–148.
In the 21st century, the number of translations of Vietnamese literature into Russian has significantly decreased. While professional translators were involved in translations during the Soviet period, at present most translations of Vietnamese works into Russian are carried out by teachers of the Vietnamese language at Russian universities. A new trend has also become the ...
Added: January 18, 2026
Русский язык и русская культура во Вьетнаме: проблемы обучения и исследования
Britov I., Ханой: Ханойский государственный университет, 2025.
Без аннотации ...
Added: January 18, 2026
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit