Количественная оценка грамматической неоднозначности некоторых европейских языков

Э. С. Клышинский; Логачёва В. К.; Карпик О. В.; Бондаренко А. В.

doi:10.25205/1818-7935-2020-18-1-5-21

?

Количественная оценка грамматической неоднозначности некоторых европейских языков

Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация. 2020. Т. 18. № 1. С. 5–21.

Klyshinskiy E., Логачёва В. К., Карпик О. В., Бондаренко А. В.

The grammatical ambiguity (multiple sets of grammatical features for one word form or coinciding surface forms of different words) can be of different types. We describe six classes of grammatical ambiguity: unambiguous, ambiguous by grammatical features, by part of speech, by lemma, by lemma and part of speech, and out-of-vocabulary words. These classes are presented in all languages, but the word distribution may vary significanlty. We calculate and analyse the statistics of these six ambiguity classes for a number of major European languages.We find that the distribution of words among the classes of ambiguity depends primarily on linguistic features of a language. Although it is influenced by text style and the considered vocabulary, the distinctive shape of the distribution is preserved under different conditions and differs significanlty from distributions for other languages. The fact that the shape is primarily defined by linguistic properties is corroborated by our observation that linguistically related languages demonstrate similar properties of ambiguous words. Slavic languages feature a low rate of part-of-speech ambiguous words and a high rate of words which are ambiguous by grammatical features. The former is also true for French and Italian, while the latter holds for German and Swedish, whereas both these traits are only characteristic of Slavic languages.

During experiments, we found that reduction of the grammatical feature set does not change the shape of distribution and therefore does not imitate similarity among languages. On the other hand, we found for all the languages that the top 1000 most frequent words have different distribution among ambiguity classes than the rest of the words. At the same time, for the majority of considered languages, less frequent words are less unambiguous by part of speech. In Romance and Germanic languages, the ambiguity is reduced for less frequent words. We also investigated the differences among statistics for texts of different genres in the Russian language. We found out that fiction texts are more ambiguous by part of speech than newswire, which are in turn more ambiguous by grammatical features.

Our results suggest that the quality of multilingual morphological taggers should be measured only by ambiguous words as opposed to all words. Such comparison could help eliminate differences among languages and get a more objective picture of the performance of linguistic tools.

Research target: Philology and Linguistics Computer Science

Priority areas: humanitarian IT and mathematics

Language: Russian

Full text

DOI

Keywords: natural language processing grammatical ambiguity Грамматическая неоднозначность автоматическая обработка текстов statistics of occurrence статистика употребления

Text, Speech and Dialogue 17th International Conference, TSD 2014, Brno, Czech Republic, September 8-12, 2014. Proceedings

Springer, 2014.

This book constitutes the refereed proceedings of the 17th International Conference on Text, Speech and Dialogue, TSD 2013, held in Brno, Czech Republic, in September 2014. The 70 papers presented together with 3 invited papers were carefully reviewed and selected from 143 submissions. They focus on topics such as corpora and language resources; speech recognition; ...

Added: September 15, 2014

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.)

М.: Издательский центр «Российский государственный гуманитарный университет», 2019.

The book includes 64 papers submitted to the International conference in computer linguistics and intellectual technologies Dialogue 2019 and presents a broad spectrum of theoretical and applied research of natural language description, language simulation, and creation of applied computer technologies. ...

Added: October 16, 2019

Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer Science

Springer, 2015.

16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part I ISBN: 978-3-319-18110-3 (Print) 978-3-319-18111-0 (Online) ...

Added: April 23, 2015

Корпус татарского языка "Туган тел"

Arkhangelskiy T., Гильмуллин Р. А., Невзорова О. А. et al., Научно-техническая информация. Серия 2: Информационные процессы и системы 2013

В статье описывается электронный корпус татарского языка, созданный в рамках программы фундаментальных исследований Президиума РАН "Корпусная лингвистика", и методы, использованные авторами для создания этого корпуса. В частности, описываются текстовый состав и жанровая структура корпуса, принятые авторами решения о выделении морфологических характеристик, автоматическая морфологическая разметка текстов с помощью двухуровневой модели морфологии и анализатора PC-KIMMO и размещение ...

Added: October 25, 2013

Applying statistical tagging to Russian poetry

Starchenko A., Kazakevich L., Lyashevskaya O., / NRU HSE. Series WP BRP "Linguistics". 2018. No. 76.

The poetic texts pose a challenge to full morphological tagging and lemmatization since the authors seek to extend the vocabulary, employ morphologically and semantically deficient forms, go beyond standard syntactic templates, use non-projective constructions and non-standard word order, among other techniques of the creative language game. In this paper we evaluate a number of probabilistic ...

Added: December 12, 2018

Проблемы обработки естественного языка в диалоговых системах

Klyshinskiy E., Жеребцова Ю., Чижик А., Системный администратор 2019 № 10 С. 82–91

Nowadays, a field of dialogue systems and conversational agents is one of the rapidly growing research areas in artificial intelligence applications. Business and industry are showing increasing interest in implementing intelligent conversational agents into their products. Many recent studies has tended to focus on possibility of developing task-oriented systems which are able to have long ...

Added: October 26, 2019

Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

Stroudsburg, PA: Association for Computational Linguistics, 2017.

This volume contains the papers presented at BSNLP-2017: the Sixth Workshop on Balto-Slavic Natural Language Processing. The Workshop is organized by SIGSLAV—Special Interest Group on NLP in Slavic Languages of the Association for Computational Linguistics. The Workshops have been convening for over a decade, with a clear vision and purpose. On one hand, the languages from ...

Added: June 13, 2017

Computational Linguistics and Intellectual Technologies

M.: Russian State University for the Humanitie, 2019.

The book includes 61 reports of the International conference on computer and intellectual technology "Dialogue-2019", representing a wide range of theoretical and applied research in the field of natural language description, modeling of language processes, creating practically applicable computer linguistic technologies. For specialists in the field of theoretical and applied linguistics and intellectual technologies. ...

Added: June 12, 2019

Universal Dependencies for Russian: A New Syntactic Dependencies Tagset

Lyashevskaya O., Droganova K., Zeman D. et al., / NRU HSE. Series WP BRP "Linguistics". 2016. No. 44.

This paper presents the Universal Dependencies tagset (UD v1) as a new annotation scheme for Russian treebanks. The universal list of dependency relations was adopted and extended to comply with certain language-specific syntactic constructions. The tagset was validated, converting two Russian treebanks into the UD format, UD-Russian-SynTagRus and UD-Russian-Google. ...

Added: December 14, 2016

Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2019)

M.: Russian State University for the Humanitie, 2019.

Added: October 16, 2019

Исследование неоднозначности употребления слов в европейских языках

Klyshinskiy E., Logacheva V. K., Мансурова О. Ю. et al., / ИПМ им. М.В. Келдыша РАН. Серия :: "ИПМ им. М.В. Келдыша РАН". 2015. № 4.

In this paper, we investigated some properties of morphological and syntactical ambiguity of using of natural language words in several European languages. We introduced a set of ambiguity classes differentiated by predefined features resulting to lexical ambiguity. The syntactical ambiguity was investigated as well. In order to provide such analysis, we examined pairs of words ...

Added: January 27, 2015

Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH)

Osaka: [б.и.], 2016.

Language resources are increasingly used not only in Language Technology (LT), but also in other subject fields, such as the digital humanities (DH) and in the field of education. Applying LT tools and data for such fields implies new perspectives on these resources regarding domain adaptation, interoperability, technical requirements, documentation, and usability of user interfaces. ...

Added: November 12, 2016

Proceedings of the 4th workshop on NLP for Computer Assisted Language Learning at NODALIDA 2015, Vilnius, 11th May, 2015

Linköping University Electronic Press, 2015.

The workshop series on Natural Language Processing (NLP) for Computer-Assisted Language Learning (CALL) – NLP4CALL – is a meeting place for researchers working on the integration of Natural Language Processing and Speech Technologies in CALL systems and exploring the theoretical and methodological issues arising in this connection. ...

Added: May 31, 2015

CLLS 2016. Computational Linguistics and Language Science. Proceedings of the Workshop on Computational Linguistics and Language Science. Moscow, Russia, April 26, 2016

Aachen: CEUR Workshop Proceedings, 2017.

As the number of digital texts increases rapidly, there is a pressing need for more advanced and diverse tools of natural language processing. While purely statistical approaches proved powerful and efficient for many NLP tasks, there are many applications that would benefit from the formal models and approaches traditional language science has to offer. With ...

Added: June 25, 2017

Извлечение сценарной информации из текстов. Часть 1. Постановка задачи и обзор методов

Суворова М. И., Кобозева М. В., Toldova S. et al., Искусственный интеллект и принятие решений 2020 № 1 С. 17–26

В статье обсуждается важность автоматического сценарного анализа для понимания текстов на естественном языке. Дан широкий обзор методов и подходов к описанию и извлечению сценариев. Рассмотрены теоретические подходы к формализации сценариев. Приведен список задач, для решения которых используется информация о сценарной структуре текста. Представлены популярные подходы к автоматическому извлечению сценариев из текстов и методы оценки их ...

Added: April 22, 2020

Information Extraction Based on Deep Syntactic-Semantic Analysis

Skorinkin D.A., Budnikov E. A., Stepanova M. E. et al., Компьютерная лингвистика и интеллектуальные технологии 2016 No. 15 P. 721–733

This paper presents a rule-based approach to Information Extraction (IE) task within FactRuEval-2016 competition. Our system is based on ABBYY Compreno Technology. The technology uses the results of deep syntactic-semantic analysis, which leads to significant reduction of the number of necessary rules and makes them laconic. The evaluation was conducted on FactRuEval dataset. FactRuEval is ...

Added: August 28, 2016

Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

Association for Computational Linguistics, 2019.

The 4th Workshop on Representation Learning for NLP (RepL4NLP) will be hosted by ACL 2019 and held on 2 August 2019. The workshop is being organised by Isabelle Augenstein, Spandana Gella, Sebastian Ruder, Katharina Kann, Burcu Can, Alexis Conneau, Johannes Welbl, Xian Ren and Marek Rei; and advised by Kyunghyun Cho, Edward Grefenstette, Karl Moritz ...

Added: November 1, 2019

Selected Papers of the 15th All-Russian Scientific Conference "Digital Libraries: Advanced Methods and Technologies, Digital Collections", Yaroslavl, Russia, October 14-17, 2013

CEUR Workshop Proceedings, 2013.

Selected Papers of the 15th All-Russian Scientific Conference "Digital Libraries: Advanced Methods and Technologies, Digital Collections" ...

Added: October 1, 2014

Using TXM Platform for Research on Language Changes over Time: The Dynamics of Vocabulary and Punctuation in Russian Literary Texts

Lavrentiev A. M., Sherstinova T., Chepovskiy A. et al., Vestnik Tomskogo Gosudarstvennogo Universiteta, Filologiya 2021 Vol. 70 P. 69–89

The purpose of this paper is to test the methodological tools provided by TXM platform for research on dynamics of vocabulary and punctuation marks in diachronic corpora. TXM is a powerful text analysis software which provides both quantitative and qualitative features in a transparent open-source implementation. In this paper, we demonstrate how it can be ...

Added: June 24, 2021

Language Exercise Generation: Emulating Cambridge Open Cloze

Malafeev A., International Journal of Conceptual Structures and Smart Applications (IJCSSA) 2014 Vol. 2 No. 2 P. 20–35

This article presents an approach to the automatic generation of open cloze exercises based on arbitrary English text. The exercise format is similar to the open cloze test used in Cambridge English certificate exams (FCE, CAE, CPE). The presented method also makes it possible to adjust the difficulty of the resulting exercises to better suit ...

Added: November 29, 2014

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 31 мая — 3 июня 2017 г.). Вып. 16 (23): В 2 т.

М.: Изд-во РГГУ, 2017.

The 16th issue of the annual report “Computational Linguistics and Intellectual Technologies” contains the selected materials of the 23rd international conference “Dialogue”. The presented works reflect the areas of research in computational modelling and analysis of natural language that are traditionally represented at the conference. ...

Added: March 15, 2017

Вестник молодых ученых ПГНИУ [Электронный ресурс]: сб. науч. тр.

Пермь: Пермский государственный национальный исследовательский университет, 2014.

В сборнике собраны статьи студентов и молодых ученых ПГНИУ, отражающие результаты научных исследований, выполняемых на базе университета. Статьи посвящены актуальным проблемам изучения естественных и гуманитарных наук. Сборник издается по итогам конкурса научно-исследовательских работ студентов ПГНИУ (апрель – ноябрь 2014 г.), в котором принимали участие все факультеты университета. ...

Added: December 30, 2014

A Language as a Self-Organized Critical System

Gromov V., Migrina A., Complexity 2017 Vol. 2017 No. Article ID 9212538 P. 1–7

A natural language (represented by texts generated by native speakers) is considered as a complex system, and the type thereof to which natural languages belong is ascertained. Namely, the authors hypothesize that a language is a self-organized critical system and that the texts of a language are “avalanches” flowing down its word cooccurrence graph. The ...

Added: September 27, 2018

Материалы Международного молодежного научного форума «ЛОМОНОСОВ-2013»

М.: МАКС Пресс, 2013.

В 2013 году Московский университет проводит очередной, крупнейший в Евразии Международный молодежный научный форум, центральным мероприятием которого является юбилейная, XX молодежная научная конференция студентов, аспирантов и молодых ученых. Сопредседателями организационного комитета Форума являются ректор Московского университета, вице-президент РАН, академик РАН В.А.Садовничий и Министр образования и науки Российской Федерации Д.В. Ливанов. Проведению Форума традиционно оказывают поддержку Исполком ...

Added: April 16, 2015