A Dataset for Noun Compositionality Detection for a Slavic Language

Puzyrev D.; Shelmanov A.; Panchenko A.; E. Artemova

doi:10.18653/v1/W19-3708

Publications

?

A Dataset for Noun Compositionality Detection for a Slavic Language

P. 56–62.

Puzyrev D., Shelmanov A., Panchenko A., Artemova E.

aper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following schema: the phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). We conduct an experimental evaluation of models and methods for predicting compositionality of noun compounds in unsupervised and supervised setups. We show that methods from previous work evaluated on the proposed Russian-language resource achieve the performance comparable with results on English corpora.

Keywords: композициональность word embeddings Noun Compositionality модель эмбеддингов слов

Publication based on the results of:

Development of Mathematical Models and Methods for Recommender Systems and Natural Language Processing (2019)

In book

Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, Florence, Italy, Association for Computational Linguistics

Association for Computational Linguistics, 2019.

Noun Compositionality Detection using Distributional Semantics for the Russian Language

Puzyrev D. A., Shelmanov A., Panchenko A. et al., , in: Analysis of Images, Social Networks and Texts. 8th International Conference AIST 2019.: Springer, 2019. P. 218–229.

In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., ...

Added: October 30, 2019

Evaluation of Vector Transformations for Russian Word2Vec and FastText Embeddings

Korogodina O., Karpik O., Klyshinsky E., , in: GraphiCon 2020 - Proceedings of the 30th International Conference on Computer Graphics and Machine Vision.: St. Petersburg: CEUR-WS, 2020.

Authors of Word2Vec claimed that their technology could solve the word analogy problem using the vector transformation in the introduced vector space. However, the practice demonstrates that it is not always true. In this paper, we investigate several Word2Vec and FastText model trained for the Russian language and find out reasons of such inconsistency. We ...

Added: October 21, 2020

Text classification with deep learning neural networks

Voronkov Ilia, Amajd M., Kaimuldenov Z., , in: Actual Problems of System and Software Engineering 2017. Proceedings of the 5th International Conference on Actual Problems of System and Software Engineering Supported by Russian Foundation for Basic Research. Project #17-07-20565 Moscow, Russia, November 14-16, 2017, 408 P.Vol. 1989.: Aachen: CEUR Workshop Proceedings, 2017. P. 362–370.

In this paper, we analyze the use of different neural networks for the text classification task. The accuracy of the studied text classifiers can be changed by a small number of previously classified texts. This is important due to the fact that in many applications of text classification a large number of unlabeled texts are easily accessible, while ...

Added: August 16, 2018

Language Interference in Heritage Russian: Constructional Violations

Rakhilina E. V., Vyrenkova A. S., / NRU HSE. Series WP BRP "Linguistics". 2014. No. 11.

The problem of incomplete language acquisition and heritage languages is approached from several perspectives: who are heritage speakers, how are they different from native speakers and L2 learners, is heritage language a particular system? This paper aims at answering these and other questions focusing on constructional deviations in the output of heritage speakers and linguistic ...

Added: October 23, 2014

A resource-light method for cross-lingual semantic textual similarity

Glavas G., Franco-Salvador M., Ponzetto S. et al., Knowledge-Based Systems 2018 Vol. 143 P. 1–9

Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many ...

Added: October 29, 2020

Дом, который построил Кэрролл: регресс и адоптация в формальном обосновании

Dragalina-Chernaya E., Логические исследования 2022 Т. 28 № 1 С. 27–49

Статья посвящена регрессу обоснования, описанному Кэрроллом в эссе «Что Черепаха сказала Ахиллу». Дискуссии о регрессе обоснования, начавшиеся задолго до его описания Кэрроллом и уходящие корнями в проблематику топического обоснования в античной и средневековой логике, продолжаются до сих пор. Большинство исследователей согласно, однако, с тем, что ключевой причиной бесконечного регресса является превращение правила вывода в дополнительную ...

Added: May 29, 2022

NPNtool: Modelling and Analysis Toolset for Nested Petri Nets

Dworzanski L. W., Frumin D. I., , in: Proceedings of the 7th Spring/Summer Young Researchers’ Colloquium on Software Engineering, SYRCoSE 2013.: Kazan: -, 2013. P. 9–14.

Nested Petri nets is an extension of Petri net formalism with net tokens for modelling multi-agent distributed systems with complex structure. While having a number of interesting properties, NP-nets have been lacking tool support. In this paper we present the NPNtool toolset for NP-nets which can be used to edit NP-nets models and check liveness ...

Added: June 18, 2013

Syntactic Idioms across Languages: Corpus Evidence from Russian and English

Apresyan V., Russian linguistics 2014 Vol. 38 No. 2 P. 187 –203

This paper considers the issues of compositionality, concessive meaning, negative polarity, scalarity, linguistic anthropocentricity, and semantics-syntax interaction in a corpus study of the concessive syntactic idiom pri vsjom X-e ‘with all X’ in Russian and its non-idiomatic counterpart with all X in English. The study demonstrates (a) both compositional and non-compositional components in the Russian ...

Added: October 9, 2014

Rotations and Interpretability of Word Embeddings: The Case of the Russian Language

Zobnin A., , in: Analysis of Images, Social Networks and Texts. 6th International Conference, 2017, Revised Selected PapersVol. 10716.: Cham: Springer, 2018. Ch. 11 P. 116–128.

Consider a continuous word embedding model. Usually, the cosines between word vectors are used as a measure of similarity of words. These cosines do not change under orthogonal transformations of the embedding space. We demonstrate that, using some canonical orthogonal transformations from SVD, it is possible both to increase the meaning of some components and ...

Added: November 26, 2017

Extracting social networks from literary text with word embedding tools

Wohlgenannt G., Artemova E., Ilvovsky D., , in: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH).: Osaka: [б.и.], 2016. Ch. 4 P. 18–26.

In this paper a social network is extracted from a literary text. The social network shows, how frequent the characters interact and how similar their social behavior is. Two types of similarity measures are used: the first applies co-occurrence statistics, while the second exploits cosine similarity on different types of word embedding vectors. The results ...

Added: March 6, 2017

Three sources of head effects

Yury Lander, , in: Headedness and/or Grammatical Anarchy?.: Berlin: Language Science Press, 2022. P. 27–51.

This paper develops the claim that head properties arise (at least) due to one of the three factors: (i) the higher position of an element in a compositional structure, (ii) the informational prominence, and (iii) the development of a construction from an appositive(-like) structure. These factors are logically independent and may lead to the assignment of head properties ...

Added: August 21, 2020

Redefining part-of-speech classes with distributional semantic models

Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.: Berlin: Association for Computational Linguistics, 2016. P. 115–125.

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...

Added: November 12, 2016

Pri vsjom X-e: a Corpus Study of a Russian Syntactic Phraseme

Apresyan V., , in: Meaning Text Theory: Current DevelopmentsVol. . Issue 85.: Muenchen: Wiener Slawistischer Almanach, 2013. Ch. 2.1 P. 132–141.

The paper presents a corpus study of the concessive syntactic phraseme pri vsjom X-e ‘with all X’ in Russian. The study demonstrates (a) a strong correlation between the semantics of the phraseme and its other linguistic properties; (b) pragmatic properties that are typical of syntactic phrasemes in general; (c) language-specific phraseological status. In particular, the ...

Added: October 13, 2013

Automatic Mining of Discourse Connectives for Russian

Toldova S., Pisarevskaya D., Kobozeva M., , in: Artificial Intelligence and Natural Language, 7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, ProceedingsIssue 930.: Switzerland: Springer, 2018. P. 79–87.

The identification of discourse connectives plays an important role in many discourse processing approaches. Among them there are functional words usually enumerated in grammars (iz-za ‘due to’, blagodarya ‘thanks to’,) and not grammaticalized expressions (X vedet k Y ‘X leads to Y’, prichina etogo ‘the cause is’). Both types of connectives signal certain relations between ...

Added: October 26, 2018

Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase

Gharavi E., Veisi H., Россо П., Neural Computing and Applications 2020 Vol. 32 No. 14 P. 10593–10607

The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust ...

Added: October 29, 2020

Exploration of register-dependent lexical semantics using word embeddings

Kutuzov A. B., Kuzmenko E., Marakasova A., , in: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH).: Osaka: [б.и.], 2016. P. 26–34.

We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they ...

Added: November 12, 2016

Онтологии для Абеляра и Элоизы

Dragalina-Chernaya E., М.: Издательский дом НИУ ВШЭ, 2012.

Монография посвящена онтологии стандартной и девиантной квантификации. В работе сопоставляются эвристические возможности и онтологические обязательства двух парадигм интерпретации кванторов: как второпорядковых предикатов и как функций выбора от их истоков (Г. Фреге и Ч.С. Пирс) до современного состояния (абстрактные логики и IF-логика). Новизна исследования связана с философской оценкой технических результатов последних лет о выразительных и дедуктивных ...

Added: May 12, 2012

Контекстуальность и композициональность. От «принципа Фреге» к когнитивным семантикам

Dragalina-Chernaya E., В кн.: Модели рассуждений – 3: когнитивный подход.: Калининград: Издательство Российского государственного университета им. И. Канта, 2010. С. 59–75.

Исследуется возможность когерентной семантической теории, принимающей принципы контекстуальности и композициональности, восходящие к Г.Фреге, но ориентированные на разнонаправленные процедуры интерпретации – от значения целого к значению частей или от значения частей к значению целого. Рассматривается диапазон вариаций этих принципов - от сильной версии принципа композициональности, реализуемого порождающими грамматиками, до более слабых вариантов. Обсуждаются перспективы когнитивных семантик ...

Added: November 15, 2012

How much does a word weight? Weighting word embeddings for word sense induction

Arefyev, N., Ermolaev P., Panchenko A., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2018" Proceedings.: M.: Conference Proceedings Editorial board, 2018. P. 68–84.

The paper describes our participation in the first shared task on word sense induction and disambiguation for the Russian language RUSSE'2018 [Panchenko et al., 2018]. For each of several dozens of ambiguous words, the participants were asked to group text fragments containing it according to the senses of this word, which were not provided beforehand, ...

Added: October 9, 2020

Learning Word Embeddings without Context Vectors

Zobnin A., Elistratova E., , in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)Issue W19-43.: Association for Computational Linguistics, 2019. P. 244–249.

Most word embedding algorithms such as word2vec or fastText construct two sort of vectors: for words and for contexts. Naive use of vectors of only one sort leads to poor results. We suggest using indefinite inner product in skip-gram negative sampling algorithm. This allows us to use only one sort of vectors without loss of ...

Added: November 9, 2019

Word Embedding for Semantically Related Words: An Experimental Study

Karyaeva M., Braslavski P., Sokolov V., Automatic Control and Computer Sciences 2019 Vol. 53 P. 638–643

The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted ...

Added: April 10, 2020

On how compositionality relates to syntactic prototypes and grammaticalization

Lander Yu., , in: Donum semanticum: Opera linguistica et logica in honorem Barbarae Partee a discipulis amicisque Rossicis oblata.: M.: Languages of Slavic culture, 2015. P. 146–155.

Added: June 22, 2015

Data-driven models and computational tools for neurolinguistics: a language technology perspective

Ekaterina Artemova, Bakarov A., Artemov A. et al., Journal of Cognitive Science 2020 Vol. 1 No. 21 P. 15–52

In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistics studies with a focus on the natural language representations, such as word embeddings and pre-trained language model. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural ...

Added: January 17, 2020

Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing

Kutuzov A. B., Козлова О. С., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)Вып. 15.: М.: Изд-во РГГУ, 2016. P. 288–300.

In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora ...

Added: November 12, 2016