• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Preprints
  • Comparative Study Of Data Clustering Algorithms And Analysis Of The Keywords Extraction Efficiency: Learner Corpus Case
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
May 22, 2026
HSE Graduates AI Project Wins at TECH & AI Awards
Daria Davydova, graduate of the HSE Graduate School of Business and Head of the AI Implementation Unit at the Artificial Intelligence Department of Alfa-Bank, received a prize at the TECH & AI Awards. She was awarded for the best AI solution for optimising business processes. The winners were determined as part of the VII Russian Summit and Awards on Digital Transformation (CDO/CDTO Summit & Awards).
May 20, 2026
HSE University Opens First Representative Office of Satellite Laboratory in Brazil
HSE University-St Petersburg opened a representative office of the Satellite Laboratory on Social Entrepreneurship at the University of Campinas in Brazil. The platform is going to unite research and educational projects in the spheres of sustainable development, communications and social innovations.
May 18, 2026
The 'Second Shift' Is Not Why Women Avoid News
Women are more likely than men to avoid political and economic news, but the reasons for this behaviour are linked less to structural inequality or family-related stress than to personal attitudes and the emotional perception of news content. This conclusion was reached by HSE researchers after analysing data from a large-scale survey of more than 10,000 residents across 61 regions of Russia. The study findings have been published in Woman in Russian Society.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Comparative Study Of Data Clustering Algorithms And Analysis Of The Keywords Extraction Efficiency: Learner Corpus Case

NRU HSE , 2020.
Scherbakova A.
Language: English
Keywords: embeddinglearner corpuskeyword extractionmetadata managementclustering analysis
Publication based on the results of:
Automated Detection of Writing Inaccuracies for Students of English in Russia (2019)
Similar publications
Bridging the Semantic Gap in Metadata Management using Large Language Models
Сулейкин А. С., Сорокина В., Пятецкий В. Е., , in: 2025 7th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency.: [б.и.], 2025. P. 748–753.
Effective metadata management is fundamental to data governance, ensuring that data assets are discoverable, understandable, and usable across the enterprise. However, traditional metadata systems often remain purely technical, describing structures without conveying business meaning. This disconnect — known as the semantic gap — limits the interpretability and value of metadata for business users. To address ...
Added: April 17, 2026
Syntactic complexity measures as linguistic correlates of proficiency level in learner Russian
Kisselev O., Klimov A., Mihail Kopotev, , in: Complexity, Accuracy and Fluency in Learner Corpus Research. Volume vi.: Amsterdam: John Benjamins Publishing Company, 2022. Ch. 3 P. 51–80.
The study reports on the results of a corpus-based evaluation of automatically extracted syntactic complexity measures as indices of Russian as a foreign language (FL) and Russian as a heritage language (HL) writing development. A list of 12 syntactic complexity measures was tested on a set of longitudinal, classroom-based data. The analyses demonstrated that the ...
Added: November 25, 2024
Построение и визуализация обобщённого графа диалога по корпусу диалогов
D'yakonov A., Штыков П. А., Прикладная дискретная математика 2023 № 59 С. 111–127
We propose a definition of a generalized dialog graph, which is used to describe the structure of a dialog over a corpus of homogeneous dialogs. The task of constructing such a graph is relevant in modern conversational artificial intelligence, but there are few works with specific results, often no full description of algorithms is given, ...
Added: March 18, 2024
Обработка слов с частотными орфографическими ошибками (исследование на базе учебного корпуса английского языка)
Klimova M., Viklova A., Overnikova D., Вестник Санкт-Петербургского университета. Язык и литература 2023 Т. 20 № 4 С. 824–837
The article presents an experimental study of the influence of the frequency of spelling errors in a word on its representation in mental lexicon. The hypothesis that frequently misspelled words cause difficulties in reading even if they are written correctly has been proved for native speakers of Russian and English. This paper aims to check ...
Added: January 26, 2024
Аннотирование учебного корпуса в аспекте его использования для исследовательских задач
Klimova M., Viklova A., Overnikova D., В кн.: Современная лингвистика: от теории к практике. III Казанский международный лингвистический саммит (Казань, 14–19 ноября 2022 г.): Труды и материалы, в трёх томах, том 1.: Каз.: Издательство Казанского университета, 2022. С. 46–50.
В данной статье рассматривается классификация ошибок, используемая в учебном корпусе REALEC, в аспекте ее соответствия требованиям и приспособленности для исследовательских задач. ...
Added: January 17, 2023
Opinion Mining for Modeling User Experience of Online Education: Sentiment Analysis and Keywords Extraction of Student Reviews
Moskvina A., Kirina M., Anastasia Gavrilyuk, , in: 2022 32nd Conference of Open Innovations Association (FRUCT).: IEEE, 2022. P. 187–195.
The paper discusses the possibilities of applying modern natural language processing technologies of opinion mining to investigate and improve the user experience of online-courses students. We analyzed 27 000 student reviews of projects within the Python programming language course. First, we applied keyword extraction algorithms as a way of semantic compression to receive a generalized ...
Added: December 9, 2022
Clausal complexity of expert and student writing: a corpus-based analysis of papers in social sciences
Smirnova E. A., Language Learning in Higher Education 2022 Vol. 12 No. 2 P. 453–475
Syntactic complexity has been extensively approached in the fields of corpus linguistics and academic discourse studies. However, works focusing on disciplinary variation in terms of linguistic complexity and comparison of professional and novice academic writing are scarce. Addressing these issues is likely to have important implications for EAP/ESP practitioners in terms of selection of target ...
Added: December 7, 2022
Review of Practices of Collecting and Annotating Texts in the Learner Corpus REALEC
Vinogradova O. I., Lyashevskaya O., , in: Text, Speech, and Dialogue. 25th International Conference, TSD 2022, Brno, Czech Republic, September 6–9, 2022, Proceedings Lecture Notes in Computer Science (LNAI), vol. 13502Vol. 13502.: Cham: Springer Publishing Company, 2022. P. 77–88.
REALEC, learner corpus released in the open access, had received 6,054 essays written in English by HSE undergraduate students in their English university-level examination by the year 2020. This paper reports on the data collection and manual annotation approaches for the texts of 2014–2019 and discusses the computer tools available for working with the corpus. ...
Added: October 5, 2022
Кластеризация данных, извлечение ключевых слов и лексическое разнообразие в текстах эссе учебного корпуса
Scherbakova A., В кн.: Межкультурное пространство: лингвистический и дидактический аспекты. Материалы секций "Межкультурная лингвистика", "Межкультурная транслатология" и студенческого научного форума. Пленарное заседание и секция «Межкультурная дидактика».Ч. 2.: Издательство ПетрГУ, 2021.
The paper focuses on the task of clustering essays produced by ESL (English as a Second Language) learners. The data was taken from a learner corpus REALEC. The division of texts by certain characteristics can be useful to speed up the analysis of a single corpus or access to the necessary sections of a large ...
Added: September 30, 2021
Автоматическое обнаружение и исправление деривационных ошибок в письменной речи на русском как иностранном
Vyrenkova A. S., Смирнов И. Ю., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2021 Т. 19 № 3 С. 57–68
Learner corpora serve as one of the most valuable sources of statistical data on learners' errors. For instance, data from foreign-language learners’ corpora can be used for the Second Language Acquisition research. However, corpora representativity strongly depends on the quality of its error markup, which is most frequently carried out manually and thus presents a ...
Added: September 24, 2021
Prediction of News Popularity via Keywords Extraction and Trends Tracking
Alexander Pugachev, Voronov A., Makarov I., , in: Recent Trends in Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020, Skolkovo, Moscow, Russia, October 15–16, 2020 Revised Supplementary ProceedingsVol. 12602.: Springer, 2021. Ch. 4 P. 37–51.
In the last years, news agencies have become more influential in various social groups. At the same time, the media industry starts to monetize online distributed articles with contextual advertising. However, the efficiency of online marketing highly depends on the popularity of news articles. In our work, we present an alternative and effective way for ...
Added: March 24, 2021
Chapter 8 Building Resilience into the Metadata-Based ETL Process Using Open Source Big Data Technologies
Panfilov P., Suleykin A., , in: Resilience in the Digital AgeVol. 12660: Lecture Notes in Computer Science.: Springer, 2021. Ch. 8 P. 139–153.
Extract-transform-load (ETL) processes play a crucial role in data analysis in real-time datawarehouse environments which demand lowlatency and high availability features for functionality. In essence, ETL- processes are becoming bottlenecks in such environments due to complexity growth, number of steps in data transformations, number of machines used for data processing and finally, increasing impact of ...
Added: February 5, 2021
Some Features of Sentiment Analysis for Russian Language Posts and Comments from Social Networks
Sidorov Nikita, Slastnikov Sergey, Journal of Physics: Conference Series 2021 Vol. 1740 P. 1–6
Sentiment analysis of different language texts is one of the very popular machine learning tasks. The complexity of its solution depends both on the characteristics of a particular language, and on the length of the evaluated texts. In our work, we consider the task of creating a sentiment analysis software tool for Russian posts and ...
Added: February 2, 2021
Keyphrase extraction from the Russian corpus on linguistics by means of KEA and RAKE algorithms
Moskvina Anna, Sokolova E., Mitrofanova O., , in: Data Analytics and Management in Data Intensive Domains. Proceedings of the XX International Conference – DAMDID/RCDL’2018, October 9-12, 2018, Moscow.: M.: FRC CSC RAS, 2018. P. 369–372.
This paper is devoted to comparison of two state-of-the-art keyphrase extraction algorithms, namely KEA based on machine learning and RAKE working with morphosyntactic patterns. Comparative study deal with peculiarities of KEA and RAKE with regard to particular research tasks. Experiments carried out on the Russian corpus on Linguistics allow to work out the best options ...
Added: September 29, 2020
Применение методов Data Science для структурирования спроса и предложения на товары и услуги (Applying Data Science methods for structuring supply and demand of goods and services)
Zhukova L., Чугунов В. Р., Кирюшина А. А. et al., В кн.: Actual Problems of System and Software Engineering. Proceedings of the 6th International Conference Actual Problems of System and Software Engineering. Moscow, Russia, 12-14 November, 2019Vol. 2514.: CEUR Workshop Proceedings, 2019. С. 336–346.
Abstract: The article describes an approach to solving the problem of structuring the supply and demand of goods and services. The proposed approach, based on the use of Data Science methods, will allow implementing modern tool for monitoring the development of industry in Moscow. Such tool helps to analyze a large number of structured, unstructured ...
Added: December 11, 2019
What’s in a comma: Corpus study of punctuation errors and L1 interference
Pospelova K., Viklova A., Vinogradova O. I., , in: Learner Corpus Conference. LCR 2019. Book of Abstracts.: [б.и.], 2019. P. 0–20.
TBC ...
Added: November 10, 2019
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit