The creation of large-scaled annotated corpora of minority languages using UniParser and the EANC platform

T. Arkhangelskiy; Belyaev O.; Vydrin A.

?

The creation of large-scaled annotated corpora of minority languages using UniParser and the EANC platform

Ch. 9. P. 83–91.

Arkhangelskiy T., Belyaev O., Vydrin A.

This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus (www.eanc.net) and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format.

UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.

Language: English

Text on another site

Keywords: corpus linguistics the Ossetic language automated morphological analysis language documentation Iranian languages

In book

Proceedings of COLING 2012: Posters

Mumbai: The COLING 2012 Organizing Committee, 2012.

Российская социология в условиях цифровизации общества: результаты анализа корпуса научных текстов

Smirnov A., Социологические исследования 2023 № 4 С. 39–50

Using the analysis of a corpus of texts from eight leading Russian sociological journals, the article examines the impact of the digitalization of society on sociology in 2000–2021. Frequency analysis of 13.8 thousand scientific texts tracked the introduction of concepts related to digitalization into academic circulation. The article reveals the differences between the journals, due ...

Added: March 18, 2026

Promotional adjectives in grant proposal abstracts: a corpus study

Dmitriy S. Tulyakov, Tatiana M. Permyakova, Ekaterina A. Balezina, Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Vol. 24 No. 6 P. 58–67

By effectively integrating promotional discourse into grant proposal abstracts, researchers can more compellingly present their ideas and increase their chances of securing funding. Implications of promotional adjectives in grant writing might differ across various research fields. This study aims to explore the use of promotional adjectives in abstracts of research grant proposals in six research ...

Added: March 2, 2026

Динамика восприятия площадей в пространстве города носителями русского языка (сравнительный анализ по данным НКРЯ)

Belova P., В кн.: Актуальные вопросы лингвистики и литературоведения: сборник научных статей по материалам международной научной конференции памяти доктора филологических наук, профессора Л.А. Араевой (6–8 февраля 2025).: Кемеровский государственный университет, 2025. С. 155–160.

This article contains research results on the dynamics of squares’ perception in the city space in the Russian language picture of the world over time, starting from the second half of the XXth century to the present. Turning to the subcorpus of literary texts of the second half of the XXth century and the XXIst ...

Added: February 4, 2026

Preposition drop in Russian spoken by Mari and Beserman bilinguals

Yakovleva A., Kosheliuk N., Moroz G., International Journal of Bilingualism 2025 P. 1–19

Aims and Research Questions: In this paper, we present a corpus-based study of preposition drop (p-drop) in the speech of Mari-Russian and Beserman-Russian bilinguals compared to the speech of Russian monolinguals. Based on data from spoken corpora, we demonstrate that the prepositions v ‘in’, k ‘to’, s ‘with’ are omitted in the speech of bilinguals ...

Added: November 26, 2025

Вариативность годов vs. лет в русских говорах: корпусное исследование

Zemicheva S., Moroz G., Naccarato C., Вопросы языкознания 2025 № 6 С. 7–34

Наличие супплетивной формы лет в парадигме существительного год отличает русский язык от других восточнославянских. При этом в русских говорах вместо лет может использоваться вариант годов. Данные панхронического подкорпуса НКРЯ показывают, что форма годов, зафиксированная впервые в XV в., на всем протяжении истории русского языка была периферийной, в XVII–XVIII вв. использовалась преимущественно в нехудожественных текстах, а в ...

Added: November 12, 2025

Automatic Annotation of Discourse and Speech Formulas in Internet Communication: A Telegram Comment Corpus

Maslenikova A., Tatiana I. Popova, , in: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I. Speech and Computer. Lecture Notes in Artificial Intelligence 16187Vol. 16187: Lecture Notes in Artificial Intelligence.: Springer, 2025. P. 278–292.

This article presents a system for the automatic processing of user comments aimed at annotating speech and discourse formulas that actively function in everyday interaction, including digital communication. A Python-based program using the Telegram API was developed to automate the collection, filtering, and annotation of empirical data. In addition to building a user corpus, the ...

Added: October 19, 2025

27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part II. Speech and Computer. Lecture Notes in Artificial Intelligence 16188

Springer, 2025.

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or ...

Added: October 19, 2025

Variation in a Narrative Corpus of Mano and Kpelle: Contact-Induced or Not?.

Khachaturyan M., Konoshenko M., Moroz G. et al., , in: N’yng-dyuumgu, n’yng-ngafq: Festschrift for Ekaterina GruzdevaVol. 126.: Helsinki: Studia Orientalia, 2025. P. 35–59.

This paper explores a corpus of spontaneous narratives and narrative retellings told by children and adults in Mano and Kpelle, two contacting Mande languages. It focuses on quotative constructions as a key point of grammatical dissimilarity between Mano and Kpelle. In the Mano speech of some bilingual children, however, these constructions are found to manifest ...

Added: September 5, 2025

Переписка Н. С. Хрущева и Ф. Кастро периода Карибского кризиса: опыт компьютеризованного анализа

Герцен А. С., В кн.: Четвёртая зимняя школа по гуманитарной информатике.: Балтийский федеральный университет им. Иммануила Канта, 2020. С. 92–97.

The article analyzes the 1st Secretary of the Central Committee of the CPSU and Chairman of the Council of Ministers of the USSR N. S. Khrushchev and the leader of the Cuban revolution F. Castro Ruz’s letters written in the period from October 26 to 31, 1962 on the topic of the Caribbean crisis and ...

Added: July 15, 2025

An overview of morphosyntactic variation in the speech of Russian-Chuvash bilinguals: number, gender, case assignment and preposition drop

Grishanova A., Russian linguistics 2025 Vol. 49 Article 10

The purpose of this study is to present a summary of morphosyntactic variation and a detailed analysis of the phenomenon of preposition drop in the Russian speech of Chuvash bilinguals. Specifically, I investigate what underlying factors might condition the variation. I conduct a qualitative analysis of the data extracted from the corpus of Russian spoken ...

Added: July 10, 2025

Do Formal Stance Strategies Reveal Disciplinary Variation in Professional Scientific Writing?

Smirnova E. A., Pérez-Guerra J., International Journal of Applied Linguistics 2025 Vol. 35 No. 3 P. 1242–1261

Stance in academic discourse has been extensively studied, with numerous investigations indicating that its expression varies across disciplines, depending on the authors’ intention to either enhance or diminish their voice or presence (e.g. It seems fairly certain versus This is based on the belief that...). This paper hypothesises that stance can be viewed as a ...

Added: April 10, 2025

Русский язык в условиях контактирования: тюркско-русское языковое взаимодействие. Часть 1. Социолингвистическое и корпусное исследование

Резанова З. И., Artemenko E., Диброва В. С. et al., Томск: Издательство Томского государственного университета, 2024.

В монографии представлены собственно лингвистические, социолингвистические и психолингвистические аспекты взаимодействия русского и трех тюркских языков – шорского, хакасского, татарского (сибирского варианта). Охарактеризованы варианты влияния тюркских языков на речевую практику и когнитивные процессы порождения и восприятия речи русскоязычными билингвами. Представлены методики сбора данных, их обработки при формировании социолингвистической базы данных и морфологически размеченного бимодального корпуса русской устной речи билингвов, ...

Added: April 7, 2025

The ‘adverb-ly adjective’ construction in English: meanings, distribution and discourse functions

Taboada M., Goddard C., Trnavac R., English Language and Linguistics 2025 Vol. 29 No. 1 P. 102–131

We investigate a class of adjective phrases composed of a deadjectival adverb ending in -ly and an adjective head (e.g. staggeringly incompetent, absolutely terrific, fiscally responsible), a compact construction whereby two adjectives may jointly contribute to evaluative meaning. Using corpus methodologies on more than 1 million examples and relying on semantic analyses of about 1,000 instances, we propose that the ...

Added: April 4, 2025

Морфологический гессер как инструмент анализа полевых данных: опыт работы с науканским языком

Будянская Е. М., Buzanov A., Жорник Д. О. et al., Томский журнал лингвистических и антропологических исследований 2025 № 2(48) С. 9–19

The paper presents the development and evaluation of two automated morphological analysis tools for Naukan Yupik (< Yupik < Eskimo < Eskimo-Aleut): a dictionary-based morphological analyzer and a dictionary-free morphological guesser. Both tools are implemented using a two-level approach to morphology modeling based on finite-state automata. The study examines in detail the morphological features of ...

Added: March 11, 2025

Creation and Analysis of the Multimedia Russian Corpus for Gesture Research

Rakhilina E. V., Cienki A., , in: The Cambridge Handbook of Gesture Studies.: Cambridge University Press, 2024. P. 249–272.

The chapter considers gesture studies in relation to corpus linguistic work. The focus is on the Multimedia Russian Corpus (MURCO), part of the Russian National Corpus. The chapter includes a brief biography of the creator of this corpus, Elena Grishina. The compilation of the corpus out of a set of Russian classic feature films and ...

Added: February 13, 2025

Non-standard numeral constructions in L2 Russian: A corpus-based study

Naccarato C., Moroz G., International Journal of Bilingualism 2026 Vol. 30 No. 2 P. 358–379

Aims and Research Questions: The paper investigates variation in numeral constructions in the L2 Russian speech of bilinguals from different regions of Russia. The main research questions are the following: What factors prompt variation in this domain of grammar? Can we argue that non-standard marking is motivated by contact? Methodology: We conduct a corpus-based study ...

Added: January 24, 2025

ИСПОЛЬЗОВАНИЕ МЕТОДОВ КОМПЬЮТЕРНОЙ ЛИНГВИСТИКИ ДЛЯ АНАЛИЗА ЛИТЕРАТУРЫХ ТЕКСТОВ

Аванесян Н. Л., Fokina A., Chepovskiy A., В кн.: Инжиниринг предприятий и управление знаниями (ИП&УЗ-2024) : сборник научных трудов XXVII Российской научной конференции. 28–29 ноября 2024 г. / под науч. ред. Ю. Ф. Тельнова. – Москва : ФГБОУ ВО «РЭУ им. Г. В. Плеханова», 2024.: М.: ФГБОУ ВО "РЭУ им. Г.В. Плеханова", 2024. С. 15–18.

Статья посвящена применению математических методов корпусного анализа для исследований литературных текстов. На примере созданных корпусов продемонстрированы возможности применения метода анализа соответствий и анализ коэффициентов попарной ранговой корреляции для сравнения частотных характеристик текстов различных подкорпусов. Описанные методики дают коррелированные результаты. Они могут использоваться как для лингвистических исследований, так и создания корректных обучающих текстовых наборов для задач искусственного интеллекта. ...

Added: December 19, 2024

Корпусная лингвистика на современном этапе

Plungian V., Вестник Российской академии наук 2024 Т. 94 № 9 С. 787–794

Даётся общее представление о корпусной лингвистике, её истории, методах и влиянии на современные представления об изучении языка, которое обычно обозначается как “корпусная революция”. ...

Added: December 16, 2024

Популистский текст как объект корпусного исследования

Галочкин А. Е., В кн.: ЧЕЛОВЕК В СИСТЕМЕ КОММУНИКАЦИЙ: ПРОФЕССИОНАЛЬНЫЕ КОММУНИКАЦИИ В ЦИФРОВУЮ ЭПОХУ.: Нижегородский государственный лингвистический университет им. Н.А. Добролюбова, 2023. С. 87–90.

This article discusses the phenomenon of populism in the context of corpus linguistics methods, which is of particular importance in the modern world. The relevance of this study is related to the growth of right-wing populism in European countries and the importance of understanding the mechanisms of populist discourse. The article analyzes studies aimed at ...

Added: November 16, 2024

Коньячку бы, да до дому: хронология развития некоторых форм второго родительного падежа

Budennaya E., Труды института русского языка им. В.В. Виноградова 2024 № 2(40) С. 261–282

The article based on the material form Russian National Corpus discusses the diachronic development of structures with Russian second genitive case in three types of contexts: 1) with nominal quantifiers; 2) with the preposition bez ‘without’; 3) with the preposition do ‘towards’. The data obtained from Russian language are compared with the data from other languages (Finnic and several Turkic), in which there is a tendency to use the partitive ...

Added: October 4, 2024