Automatic Disambiguation in the Corpora of Modern Greek and Yiddish

E. Kuzmenko; E. Mustakimova

?

Automatic Disambiguation in the Corpora of Modern Greek and Yiddish

P. 388-398.

Kuzmenko E., Mustakimova E.

The problem of morphological ambiguity is widely addressed in the modern NLP. Mostly ambiguity is resolved with the use of large manually-annotated corpora and machine learning. However, such methods are not always available, as good training data is not accessible for all languages. In this paper we present a method of disambiguation without gold standard corpora using several statistical models, namely, Brill algorithm (Brill 1995) and unambiguous n-grams from the automatically annotated corpus. All the methods were tested on the Corpus of Modern Greek and on the Corpus of Modern Yiddish. As a result, more than a half of words with ambiguous analyses were disambiguated in both corpora, demonstrating high precision (>80%). Our method of morphological disambiguation demonstrates that it is possible to eliminate some of the ambiguous analyses in the corpus without specific linguistic resources, only with the use of raw data, where all possible morphological analyses for every word are indicated.

Language: English

Text on another site

Keywords: морфологический анализ корпусная лингвистика греческий язык corpus linguistics дизамбигуация идиш Yiddish Greek morphological tagging morphological disambiguation

In book

Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015)

М. : Изд-во РГГУ, 2015

Автоматическое определение частей речи для русского языка с помощью обучения трансформаций.

Kitov V. V., Научные труды Вольного экономического общества России 2014 Т. 186 С. 228-235

This paper describes the application of well-known «transformation-based learning» algorithm of automatic rule generation for the task of part-of-speech tagging. Algorithm is applied to corpora of annotated Russian texts and accuracy as well as most significant rules are shown. ...

Added: March 16, 2016

Adverbial phrases in Hasidic Yiddish

Arkhangelskiy T., Panova T., International Journal of the Sociology of Language 2014

The purpose of our study is to investigate the lexicalization of so-called adverbial phrases, such as fun a mol, in modern Hasidic Yiddish in comparison with written literary Yiddish of the 20th century. The phenomenon in question is a historical process in which several lexemes forming a frequent collocation (including nouns, adjectives, adverbs, prepositions and ...

Added: December 11, 2014

Daba: a model and tools for Manding corpora

Kirill Maslinsky, , in : TALN-RECITAL 2014 Workshop TALAf 2014 : Traitement Automatique des Langues Africaines (TALAf 2014: African Language Processing). : Marseille : Association pour le Traitement Automatique des Langues, 2014. P. 114-122.

This article provides a brief overview of Daba software package created in the course of building corpora for Manding languages. Key software features are motivated by the tasks and problems characteristic of many African languages. The corpus-building model proposed here was initially developed for Bambara Reference Corpus which is available online and is freely accessible. ...

Added: March 26, 2015

Пространственные отношения «справа/слева» в кафаревусе: корпусное исследование

Yakovleva A., Вестник Православного Свято-Тихоновского гуманитарного университета. Серия 3: Филология 2019 Т. 58 № 1 С. 43-58

The paper deals with the encoding of “right” and “left” in Katharevousa Greek, which provides us with worth-exploring data on intentionally archaizing, artificial language of the XIX-XX centuries. The research is carried out on the basis of the Corpus of Modern Greek and the translations of two Classical Greek texts (“Anabasis” by Xenophon and “The ...

Added: October 1, 2018

Грамматические профили и формальная дифференциация русских двувидовых глаголов

Piperski A., В кн. : Двенадцатая Конференция по типологии и грамматике для молодых исследователей. Тезисы докладов (Санкт-Петербург, 19–21 ноября 2015 г.). : СПб. : Издательство Нестор-История, 2015. С. 69-72.

Исследование особенностей русских двувидовых глаголов при помощи корпусных методов ...

Added: November 22, 2015

Глаголы звуков животных в идише

Luchina E., Baranova S., В кн. : «Глаголы звуков животных: типология метафор». : М. : Языки славянских культур, 2015.

Работа проведена в русле лексической типологии и ориентируется на её комплексный подход с использованием словарей, корпусов и анкетирования информантов. Первым этапом исследования, как обычно, являлся сбор материала по лексикографическим источникам. Дополнительным промежуточным результатом является ранжирование словарей по их пригодности для лексико-типологического исследования. Метафорические модели и совмещения, найденные в материале языка идиш, несмотря на их небольшое ...

Added: December 12, 2014

Looking for contextual cues to differentiating modal meanings: A corpus-based study

Lyashevskaya O., Ovsjannikova M., Szymor N. et al., , in : Quantitative approaches to the Russian language. : Abingdon : Routledge, 2018. P. 51-78.

The domain of modality is structurally diverse and may be described in multiple ways (for example, see Perkins, 1983; Wierzbicka, 1987; Hengeveld, 1988/2004; Sweetser, 1990; Bondarko, 1990; Bybee et al., 1994; van der Auwera and Plungian, 1998; Palmer, 2001; Hansen, 2004; Nuyts, 2006; Khrakovsky, 2007). The article reports on the Russian part of a larger survey ...

Added: October 24, 2017

Корпусный анализ русского стиха

М. : Азбуковник, 2013

В настоящий сборник вошли статьи, подготовленные с использованием материалов поэтического корпуса Национального корпуса русского языка. Авторы статей прослеживают на обширном материале историю отдельных слов в языке поэзии, анализируют разные аспекты поэтической грамматики и семантики, рассматривают некоторые формальные параметры русского стиха. Сборник предназначен для специалистов в области лингвистической поэтики, стиховедения, а также для тех, кто интересуется современными ...

Added: September 28, 2013

Корпус в обучении иностранному языку (на материале английского языка)

Gorina O. G., СПб. : Свое Издательство, 2014

В настоящем издании наглядно иллюстрируются широкие лингводидактические возможности корпусной лингвистики при обучении профессионально-ориентированному общению на английском языке. Обширный языковой материал специально разработанного корпуса профессионального дискурса и других корпусных ресурсов лег в основу вариативных упражнений, заданий, исследований, которые использовались для развития лексических навыков в устной и письменной речи студентов специальности «Регионоведение». Рекомендуется специалистам – филологам, лингводидактам, ...

Added: February 20, 2017

Корпус как инструмент и как идеология: о некоторых уроках современной корпусной лингвистики

Plungian V., Русский язык в научном освещении 2008 № 16 (2) С. 7-20

Added: November 12, 2023

О способах и средствах выражения страха в русской языковой картине мира

Botchkarev A., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2016 Т. 14 № 3 С. 5-14

This article explores the ways of displaying fear in the Russian language image of the world. According to the National Corpus of the Russian language, in its most usual manifestation, fear covers and paralyzes; this distressing emotion is caused by somebody, apprehension to lose something or somebody as well as by exposure to an imminent ...

Added: November 28, 2016

Двусложные сравнительные союзы в русской поэзии

Piperski A., В кн. : Труды Международной научной конференции "Корпусная лингвистика-2015". : СПб. : Издательство СПбГУ, 2015. С. 374-381.

The paper deals with the use of disyllabic comparative conjunctions budto, slovno and točno ‘like’ in the texts of fifteen Russian poets. I study the frequency of their use in cases where these conjunctions are mutually interchangeable and show that their total frequency increases after the end of the Golden Age of Russian poetry (approx.. ...

Added: March 15, 2017

Pitfalls of the Geographic Population Structure (GPS) Approach Applied to Human Genetic History: A Case Study of Ashkenazi Jews

Flegontov P., Kassian A., Thomas M. et al., Genome Biology and Evolution 2016 Vol. 8 No. 7 P. 2259-2265

In a recent interdisciplinary study, Das et al. have attempted to trace the homeland of Ashkenazi Jews and of their historical language, Yiddish (Das et al. 2016. Localizing Ashkenazic Jews to Primeval Villages in the Ancient Iranian Lands of Ashkenaz. Genome Biol Evol. 8:1132–1149). Das et al. applied the geographic population structure (GPS) method to ...

Added: October 21, 2017

Referential Choice: Predictability and Its Limits

Kibrik A. A., Khudyakova M., Dobrov G. B. et al., Frontiers in Psychology 2016 Vol. 7 No. 1429 P. 1-21

We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, ...

Added: September 28, 2016

После, через, спустя во временны́х контекстах: из наблюдений над текстами казахско-русских билингвов

Rakhilina E. V., Казкенова А. К., Akhapkina Y., Вестник Томского государственного университета. Филология 2021 Т. 73 С. 93-113

Рассматриваются случаи нестандартного употребления казахско-русскими билингвами предлогов после, через и спустя во временны́х контекстах. Доказывается, что отклонения обусловлены грамматическими различиями между родным и русским языками. Анализ отклонений выявил специфические черты предлогов: способность указывать на завершение событий и отрезков времени, как единичных, так и повторяющихся, а также неоднозначность через в составе сочетаний с названиями разных временны́х интервалов. ...

Added: December 1, 2021

Еще раз об исследовательском потенциале поэтического корпуса: метр, лексика, формула

Orekhov B., Труды института русского языка им. В.В. Виноградова 2015 № 6 С. 449-463

The article continues the trend of other researchers’ publications that demonstrate the opportunities of the poetic subcorpus of the Russian National corpus. The question is, what issues related to the history of Russian poetry can be solved with the help of the corpus. In the first part of the article there is a pilot study ...

Added: March 16, 2016

Russian Minority Languages on the Web: Descriptive Statistics

Orekhov B., Krylova I., Popov I. et al., Компьютерная лингвистика и интеллектуальные технологии 2016 No. 15 (22) P. 452-461

Статья о малых языках России в Интернете ...

Added: November 7, 2017

Корпусные методы исследования сложных случаев полисемии

Krongauz M., В кн. : Методы когнитивного анализа семантики слова: компьютерно-корпусный подход. : Издательский дом ЯСК, 2019. С. 119-140.

В настоящей работе анализируются сложные случаи полисемии в русском языке с использованием корпусных методов ...

Added: December 6, 2019

Maninka Reference Corpus: A Presentation

Vydrin V., Rovenchak A., Maslinsky K. A., , in : Actes de la conférence conjointe JEP-TALN-RECITAL 2016. Vol. 11: Traitement automatique des langues africaines (TALAf) .: P. : Association pour le Traitement Automatique des Langues, 2016. P. 87-94.

An annotated corpus of Guinean Maninka, Corpus Maninka de Référence (CMR), was published in April 2016. It includes two subcorpora: one contains texts originally written in Latin-based graphics (792,778 words), and the other one is composed of texts in N'ko alphabet (3,105,879 words). Both subcorpora are searchable in both Latin-based graphics and in N'ko. In ...

Added: March 10, 2017

Компьютерные методы анализа для определения гендерной принадлежности текста. Опыт практического исследования

Khomenko A., В кн. : Когнитивно-дискурсивная парадигма в лингвистике и смежных науках: современные проблемы и методология исследования: материалы Х Международного конгресса по когнитивной лингвистике. 17–20 сентября 2020 г. Т. 2(41).: Уральский государственный педагогический университет, 2020. С. 893-897.

В настоящей статье речь пойдет о применении интегративного подхода к определению гендера в рамках решения задач судебной лингвистики. Автор интегрирует методы когнитивной науки, корпусной и, шире, компьютерной лингвистики, а также классический структурный анализ текста для идентификации характеристик мужской и женской речи. ...

Added: August 11, 2021

Corpus of Russian student texts: design and prospects

Zevakhina N., Dzhakupova S., , in : Материалы 21-й Международной конференции по компьютерной лингвистике "Диалог". : М. : Изд-во РГГУ, 2015.

The Corpus of Russian Student Texts (CoRST) is a computational and research project started in 2013 at the Linguistic Laboratory for Corpora Research Technologies at HSE. It comprises a collection of Russian texts written by students from various Russian universities. Its main research goal is to examine language deviations viewed as markers of language change. ...

Added: May 20, 2015

Discovering dialectal differences based on oral corpora

Andriyanets V., Daniel M., Pakendorf B., , in : Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.). Вып. 17(24).: М. : Издательский центр «Российский государственный гуманитарный университет», 2018. P. 28-38.

This paper discusses a method to detect statistically significant linguistic differences between corpora while factoring in possible variability within the very corpora to be compared. Specifically, we compare two small corpora of dialects of Even, Bystraja and Lamunkhin Even, in an attempt to identify morphemes that are more frequent in either of the corpora. To ...

Added: June 19, 2018

Corpora as indicators of (non-)existence

Piperski A., , in : Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015). : М. : Изд-во РГГУ, 2015. P. 494-500.

This paper discusses the notions of acceptability, occurrence, grammaticality and existence, and focuses on the relationship between corpus linguistics and the question of the existence of lexical items. Since corpora are almost exclusively samples from larger populations, it is claimed that they cannot provide evidence for non-existence of words, collocations or constructions. This is because ...

Added: March 13, 2016

Публика

Skorinkin D., В кн. : Два века в двадцати словах. : М. : Издательский дом НИУ ВШЭ, 2016. С. 294-316.

Статья рассказывает о развитии и изменении значений слова "Публика" на протяжениии XIX-XX веков ...

Added: May 12, 2016