• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • The creation of large-scaled annotated corpora of minority languages using UniParser and the EANC platform
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
April 30, 2026
HSE Researchers Compile Scientific Database for Studying Childrens Eating Habits
The database created at HSE University can serve as a foundation for studying children’s eating habits. This is outlined in the study ‘The Influence of Age, Gender, and Social-Role Factors on Children’s Compliance with Age-Based Nutritional Norms: An Experimental Study Using the Dish-I-Wish Web Application.’ The work has been carried out as part of the HSE Basic Research Programme and was presented at the XXVI April International Academic Conference named after Evgeny Yasin.
April 30, 2026
New Foresight Centre Study Identifies the Most Destructive Global Trends for Humankind
A team of researchers from the HSE International Research and Educational Foresight Centre has examined how global trends affect the quality of human life—from life expectancy to professional fulfilment. The findings of the study titled ‘Human Capital Transformation under the Influence of Global Trends’ were published in Foresight.
April 28, 2026
Scientists Develop Algorithm for Accurate Financial Time Series Forecasting
Researchers at the HSE Faculty of Computer Science benchmarked more than 200,000 model configurations for predicting financial asset prices and realised volatility, showing that performance can be improved by filtering out noise at specific frequencies in advance. This technique increased accuracy in 65% of cases. The authors also developed their own algorithm, which achieves accuracy comparable to that of the best models while requiring less computational power. The study has been published in Applied Soft Computing.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

The creation of large-scaled annotated corpora of minority languages using UniParser and the EANC platform

Ch. 9. P. 83–91.
Arkhangelskiy T., Belyaev O., Vydrin A.

This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www.eanc.net) and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format.

UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.

Language: English
Text on another site
Keywords: corpus linguisticsthe Ossetic languageautomated morphological analysislanguage documentationIranian languages

In book

Proceedings of COLING 2012: Posters
Mumbai: The COLING 2012 Organizing Committee, 2012.
Similar publications
Российская социология в условиях цифровизации общества: результаты анализа корпуса научных текстов
Smirnov A., Социологические исследования 2023 № 4 С. 39–50
Using the analysis of a corpus of texts from eight leading Russian sociological journals, the article examines the impact of the digitalization of society on sociology in 2000–2021. Frequency analysis of 13.8 thousand scientific texts tracked the introduction of concepts related to digitalization into academic circulation. The article reveals the differences between the journals, due ...
Added: March 18, 2026
Promotional adjectives in grant proposal abstracts: a corpus study
Dmitriy S. Tulyakov, Tatiana M. Permyakova, Ekaterina A. Balezina, Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Vol. 24 No. 6 P. 58–67
By effectively integrating promotional discourse into grant proposal abstracts, researchers can more compellingly present their ideas and increase their chances of securing funding. Implications of promotional adjectives in grant writing might differ across various research fields. This study aims to explore the use of promotional adjectives in abstracts of research grant proposals in six research ...
Added: March 2, 2026
Динамика восприятия площадей в пространстве города носителями русского языка (сравнительный анализ по данным НКРЯ)
Belova P., В кн.: Актуальные вопросы лингвистики и литературоведения: сборник научных статей по материалам международной научной конференции памяти доктора филологических наук, профессора Л.А. Араевой (6–8 февраля 2025).: Кемеровский государственный университет, 2025. С. 155–160.
This article contains research results on the dynamics of squares’ perception in the city space in the Russian language picture of the world over time, starting from the second half of the XXth century to the present. Turning to the subcorpus of literary texts of the second half of the XXth century and the XXIst ...
Added: February 4, 2026
Preposition drop in Russian spoken by Mari and Beserman bilinguals
Yakovleva A., Kosheliuk N., Moroz G., International Journal of Bilingualism 2025 P. 1–19
Aims and Research Questions: In this paper, we present a corpus-based study of preposition drop (p-drop) in the speech of Mari-Russian and Beserman-Russian bilinguals compared to the speech of Russian monolinguals. Based on data from spoken corpora, we demonstrate that the prepositions v ‘in’, k ‘to’, s ‘with’ are omitted in the speech of bilinguals ...
Added: November 26, 2025
Вариативность годов vs. лет в русских говорах: корпусное исследование
Zemicheva S., Moroz G., Naccarato C., Вопросы языкознания 2025 № 6 С. 7–34
Наличие супплетивной формы лет в парадигме существительного год отличает русский язык от других восточнославянских. При этом в русских говорах вместо лет может использоваться вариант годов. Данные панхронического подкорпуса НКРЯ показывают, что форма годов, зафиксированная впервые в XV в., на всем протяжении истории русского языка была периферийной, в XVII–XVIII вв. использовалась преимущественно в нехудожественных текстах, а в ...
Added: November 12, 2025
Automatic Annotation of Discourse and Speech Formulas in Internet Communication: A Telegram Comment Corpus
Maslenikova A., Tatiana I. Popova, , in: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I. Speech and Computer. Lecture Notes in Artificial Intelligence 16187Vol. 16187: Lecture Notes in Artificial Intelligence.: Springer, 2025. P. 278–292.
This article presents a system for the automatic processing of user comments aimed at annotating speech and discourse formulas that actively function in everyday interaction, including digital communication. A Python-based program using the Telegram API was developed to automate the collection, filtering, and annotation of empirical data. In addition to building a user corpus, the ...
Added: October 19, 2025
27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part II. Speech and Computer. Lecture Notes in Artificial Intelligence 16188
Springer, 2025.
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or ...
Added: October 19, 2025
Variation in a Narrative Corpus of Mano and Kpelle: Contact-Induced or Not?.
Khachaturyan M., Konoshenko M., Moroz G. et al., , in: N’yng-dyuumgu, n’yng-ngafq: Festschrift for Ekaterina GruzdevaVol. 126.: Helsinki: Studia Orientalia, 2025. P. 35–59.
This paper explores a corpus of spontaneous narratives and narrative retellings told by children and adults in Mano and Kpelle, two contacting Mande languages. It focuses on quotative constructions as a key point of grammatical dissimilarity between Mano and Kpelle. In the Mano speech of some bilingual children, however, these constructions are found to manifest ...
Added: September 5, 2025
Переписка Н. С. Хрущева и Ф. Кастро периода Карибского кризиса: опыт компьютеризованного анализа
Герцен А. С., В кн.: Четвёртая зимняя школа по гуманитарной информатике.: Балтийский федеральный университет им. Иммануила Канта, 2020. С. 92–97.
The article analyzes the 1st Secretary of the Central Committee of the CPSU and Chairman of the Council of Ministers of the USSR N. S. Khrushchev and the leader of the Cuban revolution F. Castro Ruz’s letters written in the period from October 26 to 31, 1962 on the topic of the Caribbean crisis and ...
Added: July 15, 2025
An overview of morphosyntactic variation in the speech of Russian-Chuvash bilinguals: number, gender, case assignment and preposition drop
Grishanova A., Russian linguistics 2025 Vol. 49 Article 10
The purpose of this study is to present a summary of morphosyntactic variation and a detailed analysis of the phenomenon of preposition drop in the Russian speech of Chuvash bilinguals. Specifically, I investigate what underlying factors might condition the variation. I conduct a qualitative analysis of the data extracted from the corpus of Russian spoken ...
Added: July 10, 2025
Do Formal Stance Strategies Reveal Disciplinary Variation in Professional Scientific Writing?
Smirnova E. A., Pérez-Guerra J., International Journal of Applied Linguistics 2025 Vol. 35 No. 3 P. 1242–1261
Stance in academic discourse has been extensively studied, with numerous investigations indicating that its expression varies across disciplines, depending on the authors’ intention to either enhance or diminish their voice or presence (e.g. It seems fairly certain versus This is based on the belief that...). This paper hypothesises that stance can be viewed as a ...
Added: April 10, 2025
Русский язык в условиях контактирования: тюркско-русское языковое взаимодействие. Часть 1. Социолингвистическое и корпусное исследование
Резанова З. И., Artemenko E., Диброва В. С. et al., Томск: Издательство Томского государственного университета, 2024.
В монографии представлены собственно лингвистические, социолингвистические и психолингвистические аспекты взаимодействия русского и трех тюркских языков – шорского, хакасского, татарского (сибирского варианта). Охарактеризованы варианты влияния тюркских языков на речевую практику и когнитивные процессы порождения и восприятия речи русскоязычными билингвами. Представлены методики сбора данных, их обработки при формировании социолингвистической базы данных и морфологически размеченного бимодального корпуса русской устной речи билингвов, ...
Added: April 7, 2025
The ‘adverb-ly adjective’ construction in English: meanings, distribution and discourse functions
Taboada M., Goddard C., Trnavac R., English Language and Linguistics 2025 Vol. 29 No. 1 P. 102–131
We investigate a class of adjective phrases composed of a deadjectival adverb ending in -ly and an adjective head (e.g. staggeringly incompetent, absolutely terrific, fiscally responsible), a compact construction whereby two adjectives may jointly contribute to evaluative meaning. Using corpus methodologies on more than 1 million examples and relying on semantic analyses of about 1,000 instances, we propose that the ...
Added: April 4, 2025
Морфологический гессер как инструмент анализа полевых данных: опыт работы с науканским языком
Будянская Е. М., Buzanov A., Жорник Д. О. et al., Томский журнал лингвистических и антропологических исследований 2025 № 2(48) С. 9–19
The paper presents the development and evaluation of two automated morphological analysis tools for Naukan Yupik (< Yupik < Eskimo < Eskimo-Aleut): a dictionary-based morphological analyzer and a dictionary-free morphological guesser. Both tools are implemented using a two-level approach to morphology modeling based on finite-state automata. The study examines in detail the morphological features of ...
Added: March 11, 2025
Creation and Analysis of the Multimedia Russian Corpus for Gesture Research
Rakhilina E. V., Cienki A., , in: The Cambridge Handbook of Gesture Studies.: Cambridge University Press, 2024. P. 249–272.
The chapter considers gesture studies in relation to corpus linguistic work. The focus is on the Multimedia Russian Corpus (MURCO), part of the Russian National Corpus. The chapter includes a brief biography of the creator of this corpus, Elena Grishina. The compilation of the corpus out of a set of Russian classic feature films and ...
Added: February 13, 2025
Non-standard numeral constructions in L2 Russian: A corpus-based study
Naccarato C., Moroz G., International Journal of Bilingualism 2026 Vol. 30 No. 2 P. 358–379
Aims and Research Questions: The paper investigates variation in numeral constructions in the L2 Russian speech of bilinguals from different regions of Russia. The main research questions are the following: What factors prompt variation in this domain of grammar? Can we argue that non-standard marking is motivated by contact?   Methodology: We conduct a corpus-based study ...
Added: January 24, 2025
ИСПОЛЬЗОВАНИЕ МЕТОДОВ КОМПЬЮТЕРНОЙ ЛИНГВИСТИКИ ДЛЯ АНАЛИЗА ЛИТЕРАТУРЫХ ТЕКСТОВ
Аванесян Н. Л., Fokina A., Chepovskiy A., В кн.: Инжиниринг предприятий и управление знаниями (ИП&УЗ-2024) : сборник научных трудов XXVII Российской научной конференции. 28–29 ноября 2024 г. / под науч. ред. Ю. Ф. Тельнова. – Москва : ФГБОУ ВО «РЭУ им. Г. В. Плеханова», 2024.: М.: ФГБОУ ВО "РЭУ им. Г.В. Плеханова", 2024. С. 15–18.
Статья  посвящена  применению  математических  методов  корпусного  анализа  для  исследований литературных текстов. На примере созданных корпусов продемонстрированы  возможности  применения  метода  анализа  соответствий  и  анализ  коэффициентов  попарной  ранговой  корреляции  для  сравнения  частотных  характеристик  текстов  различных подкорпусов.  Описанные  методики  дают  коррелированные  результаты.  Они  могут  использоваться  как  для  лингвистических  исследований,  так  и  создания  корректных обучающих текстовых наборов для задач искусственного интеллекта. ...
Added: December 19, 2024
Корпусная лингвистика на современном этапе
Plungian V., Вестник Российской академии наук 2024 Т. 94 № 9 С. 787–794
Даётся общее представление о корпусной лингвистике, её истории, методах и влиянии на современные представления об изучении языка, которое обычно обозначается как “корпусная революция”. ...
Added: December 16, 2024
Популистский текст как объект корпусного исследования
Галочкин А. Е., В кн.: ЧЕЛОВЕК В СИСТЕМЕ КОММУНИКАЦИЙ: ПРОФЕССИОНАЛЬНЫЕ КОММУНИКАЦИИ В ЦИФРОВУЮ ЭПОХУ.: Нижегородский государственный лингвистический университет им. Н.А. Добролюбова, 2023. С. 87–90.
This article discusses the phenomenon of populism in the context of corpus linguistics methods, which is of particular importance in the modern world. The relevance of this study is related to the growth of right-wing populism in European countries and the importance of understanding the mechanisms of populist discourse. The article analyzes studies aimed at ...
Added: November 16, 2024
Коньячку бы, да до дому: хронология развития некоторых форм второго родительного падежа
Budennaya E., Труды института русского языка им. В.В. Виноградова 2024 № 2(40) С. 261–282
The article based on the material form Russian National Corpus discusses the diachronic development of structures with Russian second genitive case in three types of contexts: 1) with nominal quantifiers; 2) with the preposition bez  ‘without’; 3) with the preposition do ‘towards’. The data obtained from Russian language are compared with the data from other languages (Finnic and several Turkic), in which there is a tendency to use the partitive ...
Added: October 4, 2024
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit