A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian

O. Lyashevskaya

?

A Reusable Tagset for the Morphologically Rich Language in Change: a Case of Middle Russian

P. 422–434.

The paper discusses the standardization efforts to create a morphological standard for the Middle Russian corpus, which is part of the historical collection of the Russian National Corpus (RNC). To meet the needs of different categories of corpus researchers as well as NLP developers, we consider two styles of the morphological annotation (RNC schema and Universal Dependencies schema). A number of specifications of the feature list proposed to facilitate data reusability, linking and conversion.

In book

Computational Linguistics and Intellectual Technologies

Issue 18. , M.: Russian State University for the Humanitie, 2019.

Взiaлъ, възялъ, вьзял: Обработка орфографической вариативности при лексико-грамматической аннотации старорусского корпуса XV-XVII вв.

Гаврилова Т. С., Шалганова Т. А., Lyashevskaya O., Вестник Православного Свято-Тихоновского гуманитарного университета. Серия 3: Филология 2017 Т. 51 С. 11–20

The highly unstable orthography of the Middle Russian texts poses a challenge for their automatic processing. The Middle Russian subcorpus of the Russian National Corpus (RNC) includes documents written mainly between 1400 and 1700, when the variation in spelling was still a norm. The task of lexico-grammatical analysis is to assign a dictionary form (lemma), ...

Added: December 14, 2016

К задаче автоматической лексико-грамматической разметки старорусского корпуса XV-XVII вв.

Гаврилова Т. С., Шалганова Т. А., Lyashevskaya O., Вестник Православного Свято-Тихоновского гуманитарного университета. Серия 3: Филология 2016 Т. 47 № 2 С. 7–25

The paper discusses two approaches to the automatic lexico-grammatical tagging of the Middle Russian texts (1400–1700), included in the Russian National Corpus (RNC). The task is to assign each token a part of speech label, a tuple of grammatical features, and a lemma (without disambiguation). Middle Russian combines, on the one hand, features of ...

Added: December 14, 2016

Disambiguation in context in the Russian National Corpus: 20 yeas later

Lyashevskaya O., Afanasev I., Stefan Rebrikov et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог». Вып. 22.Вып. 22. [б.и.], 2023. P. 307–318.

An updated annotation of the Main, Media, and some other corpora of the Russian National Corpus (RNC) features the part-of-speech and other morphological information, lemmas, dependency structures, and constituency types. Transformer-based architectures are used to resolve the homonymy in context according to a schema based on the manually disambiguated subcorpus of the Main corpus (morphology ...

Added: September 15, 2023

An HMM-based PoS tagger for Old Church Slavonic

Lyashevskaya O., Afanasev I., Jazykovedny Casopis 2021 Vol. 72 No. 2 P. 556–567

We present a hybrid HMM-based PoS tagger for Old Church Slavonic. The training corpus is a portion of one text, Codex Marianus (40k) annotated with the Universal Dependencies UPOS tags in the UD-PROIEL treebank. We perform a number of experiments in within-domain and out-of-domain settings, in which the remaining part of Codex Marianus serves as ...

Added: October 21, 2021

MorphoRuEval-2017: an Evaluation Track for the Automatic Morphological Analysis Methods for Russian

Sorokin A., Shavrina T., Lyashevskaya O. et al., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2017" ProceedingsVol. 1. Issue 16 (23). M.: -, 2017. P. 297–313.

MorphoRuEval-2017 is an evaluation campaign designed to stimulate the development of the automatic morphological processing technologies for Russian, both for normative texts (news, fiction, nonfiction) and those of less formal nature (blogs and other social media). This article compares the methods participants used to solve the task of morphological analysis. It also discusses the problem ...

Added: October 9, 2018

Автоматизация процесса адаптации текстов для электронных учебников. Проблемы и перспективы (на примере русского языка)

Sibirtseva V., Karpov N., Nová rusistika/ Новая русистика 2014 № 1 С. 19–35

The paper is intended to describe the experience of using the authentic linguistic corpus materials within the project "Creating an electronic textbook of Russian as a foreign language". Special attention is paid to the fundamental principles of the new project – automatic adaptation of RNC’s linguistic material. Worked out by means of information technologies, the ...

Added: December 3, 2013

Берестяные грамоты из раскопок 2023 г. в Великом Новгороде и Старой Руссе

Gippius A., Вопросы языкознания 2024 № 4 С. 7–26

The article contains a preliminary publication of nineteen birchbark letters found during the archaeological season of 2023 in Veliky Novgorod (Nos. 1158–1172) and Staraya Russa (Nos. 55–58). The published documents date back to the 12th— early 16th centuries. From the historical point of view, three 14th-century documents are of the greatest value: No. 1164 is ...

Added: September 7, 2024

Stem initial alternation in Russian third person pronouns: variation in grammar

Daniel M., , in: Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015). М.: Изд-во РГГУ, 2015. P. 95–103.

The paper discusses the present stage of the evolution of the initial [n]/[j] stem alternation in Russian third person pronouns. After providing a short overview of the origins of the forms, I focus on their category status, discuss Zalizniak’s ‘adpositionality’ in some detail, and then proceed to considering the cases where the ‘n’-forms are induced ...

Added: October 9, 2015

Text collections for evaluation of Russian morphological taggers

Lyashevskaya O., Bocharov V., Sorokin A. et al., Jazykovedny Casopis 2017 Vol. 68 No. 2 P. 258–267

The paper describes the preparation and development of the text collections within the framework of MorphoRuEval-2017 shared task, an evaluation campaign designed to stimulate development of the automatic morphological processing technologies for Russian. The main challenge for the organizers was to standardize all available Russian corpora with the manually verified high-quality tagging to a single ...

Added: January 30, 2018

Берестяные грамоты из раскопок 2022 г. в Великом Новгороде и Старой Руссе

Gippius A., Вопросы языкознания 2023 № 5 С. 7–28

: The article contains a preliminary publication of twelve birchbark letters of the twelfth— first half of the fifteenth century, found in the archaeological season of 2022 in Veliky Novgorod (Nos. 1146– 1157), and letters Nos. 52 and 53 from Staraya Russa. Letters Nos. 1142 and 1143 from the excavations of 2021, which were not included ...

Added: February 13, 2024

К хронологии утраты действительным причастием прошедшего времени предикативности в истории русского языка

Maria Ermolova, Russian Linguistics 2023 Т. 47 № 3 С. 323–342

The paper discusses the use of the short past active participles (PAP) in the Russian language of the 17th c. The data was collected from private letters from the 17th c. and the first Russian newspaper, Vesty-Kuranty. The function of PAP in the XVIIth c. is compared with their use in both the earlier and ...

Added: November 21, 2023

A hybrid lemmatiser for Old Church Slavonic

Afanasev I., / НИУ ВШЭ. Series WP BRP "Linguistics". 2021.

The article considers a lemmatiser that is developed specifically for Old Church Slavonic (OCS). The introduction underlines the problem of the lack of lemmatisers that might deal with different datasets of the OCS. The review gives a short description of previous attempts and current trends in lemmatisation. The lemmatiser is hybrid-based and uses the advantages ...

Added: December 28, 2021

Визуализация данных для каталога русских лексических конструкций (на материале НКРЯ)

Митрофанова О. А., Паничева П. В., В кн.: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной Международной конференции «Диалог» (Бекасово, 29 мая - 2 июня 2013 г.). В 2-х т.Т. 1: Основная программа конференции. Вып. 12 (19). М.: РГГУ, 2013. С. 465–477.

Our research aims at automatic identification of constructions associated with particular lexical items and its subsequent use in building the catalogue of Russian lexical constructions. The study is based on the data extracted from the Russian National Corpus (RNC, http://ruscorpora.ru). The main accent is made on extensive use of morphological and lexico-semantic data drawn from ...

Added: September 23, 2013

Standard-shifting in the adjectival domain: Corpus evidence and discussion

Zevakhina N., , in: http://spe6conference.wordpress.com/materials/. [б.и.], 2013.

Ссылка на постер - http://spe6conference.files.wordpress.com/2013/07/zevakhina.pdf Постер посвящен корпусному исследованию прилагательных русского языка, проведенному с помощью Национального Корпуса Русского Языка. Исследование подтверждает гипотезу о том, что у прилагательных нет заданного семантического стандарта, который определяется контекстом. ...

Added: October 14, 2013

«Мигрант» и «миграция» по данным словарей и лингвистических корпусов русского, чешского и немецкого языков

Sibirtseva V., Крылова Л.К., В кн.: Мультикультурализм или интеркультурализм? Опыт Австрии, России, ЕвропыТ. 9. Н. Новгород: Деком, 2013. С. 78–86.

The topic of the article reflects the relationship to the concepts of "migration" and "worker" in Russia, the Czech Republic and in German-speaking countries over the past 30 years. Frequency of use of these words is confirmed by the fact that migration is a very difficult and complex problem to solve. Language is sensitive to ...

Added: October 4, 2013

Корпусные инструменты в грамматических исследованиях русского языка

Lyashevskaya O., М.: Языки славянской культуры, 2016.

Corpus linguistics can be broadly defined in terms of two partially overlapping research dimensions . On the one hand, corpus linguistics is knowledge of how to compile and annotate linguistic corpora. On the other hand, corpus linguistics is a family of qualitative and quantitative methods of language study based on corpus data. The book presents ...

Added: March 26, 2015

К вопросу о причастном функционировании -л-формы в истории русского языка в свете русских диалектных данных и инославянского материала

Ermolova M., Slovĕne 2022 Т. 11 № 1 С. 245–280

The article analyzes the hypothesis about the participial functioning of the l-form in the history of the Russian language in the light of Russian dialectal data and the material of the other Slavic languages. Many facts that confirm this hypothesis are found both in Russian dialects and Slavic languages. The first part of the paper ...

Added: January 27, 2023

Шестники: к значению и происхождению социального термина

Gippius A., Шаги/Steps 2021 Т. 7 № 3 С. 67–81

The social term shestnik, known from Novgorod-Pskov sources of the 13th–16th centuries, despite repeated attempts to interpret it, has not yet received a convincing explanation either in terms of its content or in terms of etymology. The article shows that the widespread understanding of this term as a designation of various kinds of newcomers, connected ...

Added: October 27, 2021

Building a Dictionary-Based Lemmatizer for Old Irish

Dereza O., , in: Actes de la conférence conjointe JEP-TALN-RECITALVol. 6: Celtic Language Technology Workshop. P.: [б.и.], 2016. P. 12–17.

This paper explores the problem of developing NLP tools for morphologically rich and orthographically inconsistent classical languages. It is a case study of building a lemmatizer for Old Irish using only a dictionary and an unlabeled corpus as sources of data. At the current stage, the lemmatizer shows 76.31% average recall score on a corpus ...

Added: October 5, 2017

Названия еврейских месяцев в средневековой славяно-русской книжности: переводы с греческого и непосредственные заимствования из семитских источников

Grishchenko A., Die Welt der Slaven. Internationale Halbjahresschrift für Slavistik 2018 Т. LXIII № 2 С. 189–214

This article collects and analyzes all forms of the names of the Hebrew months in the medieval Slavonic-Russian literature. The first list of these names appeared in the multilingual set of names by Pseudo-John of Damascus, translated from Greek into Old Bulgarian and preserved in the Izbornik of 1073. Then other lists of Hebrew months, translated ...

Added: October 21, 2020

В поисках триггера: книжные и некнижные тексты как маркеры различных аспектов русской референциальной эволюции

Budennaya E., Slovĕne 2020 Т. 9 № 2 С. 210–243

The article deals with the diachronic path of Russian pronoun expansion, which affected the period of the 11th–17th centuries: paki li ∅pro soromit ∅pro sebe svobodna > jesli on osramit — ona svobodna ‘if he rapes [the slave], she is freed’ (the treaty of 1191–1192 between Novgorod, Gotland, and the German Cities, and its modern ...

Added: March 1, 2021

Машьякъ-антихрист, мессия еврейский, и его компания: ранние гебраизмы восточнославянской книжности в эсхатологическом контексте

Grishchenko A., В кн.: «Последние времена» в славянской и еврейской культурной традиции. М.: Научно-гуманитарный центр «Сэфер», 2023. С. 85–123.

The paper reviews the manuscript tradition of three Hebraisms from the Early East Slavic literature, as following: Mašliakh occurred in the Palaea Interpretata (that was connected to earlier Mašika / Mašiaak from the Addresses to a Jew on the Incarnation of the Son of God of the Miscellany from the 13th century, i.e., resp. Hebrew Māšîaḥ ...

Added: January 11, 2024

A cross-genre morphological tagging and lemmatization of the Russian poetry: distinctive test sets and evaluation

Starchenko A., Lyashevskaya O., , in: Digital Transformation and Global Society. Fourth International Conference, DTGS 2019, St. Petersburg, Russia, June 19–21, 2019, Revised Selected Papers. Springer, 2019. P. 732–743.

The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic ...

Added: June 12, 2019

Берестяные грамоты из раскопок 2018 г. в Великом Новгороде и Старой Руссе

Gippius A., Вопросы языкознания 2019 № 4 С. 47–71

The article is a preliminary publication of the birchbark letters found in Veliky Novgorod and Staraya Russa during the archaeological season of 2018. ...

Added: October 11, 2019