• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • Automatic Mining of Discourse Connectives for Russian
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 11, 2026
Doctoral Student at HSE University Reveals Hidden Layout of Ancient Parion
İdil Malgil, a researcher at HSE University, conducted a UAV-based LiDAR survey of the ancient Roman city of Parion in present-day Turkey. The high density of the scans allowed the team to detect subtle terrain features concealed beneath the ground and vegetation. The survey revealed traces of entire neighbourhoods, terraced structures, and walls that had remained invisible during routine excavations and could not be identified through aerial photography. The findings have been published in Ancient Civilizations from Scythia to Siberia.
June 11, 2026
Mathematicians from Nizhny Novgorod and Shanghai Study System Stability
Mathematicians at HSE University–Nizhny Novgorod, in collaboration with colleagues from Tongji University in Shanghai, are investigating the fundamental causes of structural stability in systems and the mechanisms underlying its disruption. In this interview with the HSE News Service, Prof. Olga Pochinka, Head of the International Laboratory of Dynamical Systems and Applications at HSE University–Nizhny Novgorod and leader of the project ‘Qualitative Theory of Systems of Ordinary and Partial Differential Equations,’ discusses the project, which is being implemented as part of HSE University's International Academic Cooperation programme.
June 11, 2026
Neurolinguists Assist in Awake Surgery on 11-Year-Old Patient with Epilepsy
Researchers at the HSE Centre for Language and Brain took part in a rare awake neurosurgical procedure performed on an 11-year-old patient with drug-resistant epilepsy. Working alongside surgeons at the Voyno-Yasenetsky Centre of Specialised Medical Care for Children in Solntsevo, they monitored the resection of a portion of the left temporal lobe, where the epileptic focus had been identified.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Automatic Mining of Discourse Connectives for Russian

P. 79–87.
Toldova S., Pisarevskaya D., Kobozeva M.

The identification of discourse connectives plays an important role in many discourse processing approaches. Among them there are functional words usually enumerated in grammars (iz-za ‘due to’, blagodarya ‘thanks to’,) and not grammaticalized expressions (X vedet k Y ‘X leads to Y’, prichina etogo ‘the cause is’). Both types of connectives signal certain relations between discourse units. However, there are no ready-made lists of the second type of connectives. We suggest a method for expanding a seed list of connectives based on their vector representations by candidates for not grammaticalized connectives for Russian. Firstly, we compile a list of patterns for this type of connectives. These patterns are based on the following heuristics: the connectives are often used with anaphoric expressions substituting discourse units (thus, some patterns include special anaphoric elements); the connectives more frequently occur at the sentence beginning or after a comma. Secondly, we build multi-word tokens that are based on these patterns. Thirdly, we build vector representations for the multi-word tokens that match these patterns. Our experiments based on distributional semantics give quite reasonable list of the candidates for connectives.

Language: English
Full text
DOI
Keywords: word embeddingsdiscourse connectivesRhetotical structure theory

In book

Artificial Intelligence and Natural Language, 7th International Conference, AINL 2018, St. Petersburg, Russia, October 17–19, 2018, Proceedings
Issue 930. , Switzerland: Springer, 2018.
Similar publications
Пунктуация в предложениях с союзом "то есть" как отражение развития дискурсивных функций
Kuvshinskaya Y. M., Аксенова А. А., В кн.: Язык и метод: Русский язык в лингвистических исследованиях XXI векаVol. 7.: Kraków: Wydawnictwo Uniwersytetu Jagiellońskiego, 2021. С. 213–222.
Работа посвящена пунктуации в предложениях с союзом "то есть". Данные Национального корпуса русского языка показываютрост частотности употребления этого коннектора в начале независимого предложения, после точки, несмотря на то, что такое употребление пока еще не закреплено в качестве кодифицированной нормы. В работе обсуждаются различия в употреблении "то есть" в середине и в начале предложения: возможность выражать ...
Added: October 25, 2023
A resource-light method for cross-lingual semantic textual similarity
Glavas G., Franco-Salvador M., Ponzetto S. et al., Knowledge-Based Systems 2018 Vol. 143 P. 1–9
Recognizing semantically similar sentences or paragraphs across languages is beneficial for many tasks, ranging from cross-lingual information retrieval and plagiarism detection to machine translation. Recently proposed methods for predicting cross-lingual semantic similarity of short texts, however, make use of tools and resources (e.g., machine translation systems, syntactic parsers or named entity recognition) that for many ...
Added: October 29, 2020
Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: no training phase
Gharavi E., Veisi H., Россо П., Neural Computing and Applications 2020 Vol. 32 No. 14 P. 10593–10607
The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust ...
Added: October 29, 2020
Evaluation of Vector Transformations for Russian Word2Vec and FastText Embeddings
Korogodina O., Karpik O., Klyshinsky E., , in: GraphiCon 2020 - Proceedings of the 30th International Conference on Computer Graphics and Machine Vision.: St. Petersburg: CEUR-WS, 2020.
Authors of Word2Vec claimed that their technology could solve the word analogy problem using the vector transformation in the introduced vector space. However, the practice demonstrates that it is not always true. In this paper, we investigate several Word2Vec and FastText model trained for the Russian language and find out reasons of such inconsistency. We ...
Added: October 21, 2020
Word2vec not dead: predicting hypernyms of co-hyponyms is better than reading definitions
Arefyev N V., Fedoseev M., Kabanov A. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной международной конференции «Диалог» (Москва, 17–20 июня 2020 г.)Issue 19(26): дополнительный том.: -, 2020. P. 13–32.
Expert-built lexical resources are known to provide information of good quality for the cost of low coverage. This property limits their applicability in modern NLP applications. Building descriptions of lexical-semantic relations manually in sufficient volume requires a huge amount of qualified human labour. However, given some initial version of a taxonomy is already built, automatic ...
Added: October 9, 2020
How much does a word weight? Weighting word embeddings for word sense induction
Arefyev, N., Ermolaev P., Panchenko A., , in: Computational Linguistics and Intellectual Technologies. International Conference "Dialogue 2018" Proceedings.: M.: Conference Proceedings Editorial board, 2018. P. 68–84.
The paper describes our participation in the first shared task on word sense induction and disambiguation for the Russian language RUSSE'2018 [Panchenko et al., 2018]. For each of several dozens of ambiguous words, the participants were asked to group text fragments containing it according to the senses of this word, which were not provided beforehand, ...
Added: October 9, 2020
Automatic Mining of Cause-Effect Discourse Connectives for Russian
Pisarevskaya D., Kobozeva M., Petukhova Y. et al., , in: Digital Transformation and Global Society: 4th International Conference, DTGS 2019, St. Petersburg, Russia, June 19–21, 2019, Revised Selected Papers.: Springer, 2019. P. 708–718.
The identification of discourse connectives plays an important role in many discourse processing approaches. Among them, there are functional words usually enumerated in grammars (iz-za ‘due to’, blagodarya ‘thanks to’,) and not grammaticalized expressions (X vedet k Y ‘X leads to Y’, prichina etogo ‘the cause is’). Both types of connectives signal certain relations between discourse units. However, there are no ...
Added: April 22, 2020
Proceedings of DISRPT 2019 - The Workshop on Discourse Relation Parsing and Treebanking. NAACL HLT 2019
Association for Computational Linguistics, 2019.
This book summarizes the main topics at the 2019 workshop on Discourse Relation Parsing and Treebanking (DISRPT 2019). Co-located with NAACL 2019 in Minneapolis, the workshop’s aim was to bring together researchers working on corpus-based and computational approaches to discourse relations. In addition to an invited talk, eighteen papers outlined below were presented, four of which ...
Added: April 22, 2020
Contrast and comparison relations in RST framework
Toldova S., Davydova T., Kobozeva M. et al., , in: Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2019)Issue 18.: M.: Russian State University for the Humanitie, 2019. P. 714–727.
The paper is devoted to a corpus study of the Contrast relation between discourse units in Russian. It is based on the data of the Ru-RSTreebank annotated within the framework of the Rhetorical Structure theory [Mann, Thompson 1988]. The research question is what cue phrases and lexical and grammatical patterns are used to express the ...
Added: April 22, 2020
Word Embedding for Semantically Related Words: An Experimental Study
Karyaeva M., Braslavski P., Sokolov V., Automatic Control and Computer Sciences 2019 Vol. 53 P. 638–643
The ability to identify semantic relations between words has made a word2vec model widely used in NLP tasks. The idea of word2vec is based on a simple rule that a higher similarity can be reached if two words have a similar context. Each word can be represented as a vector, so the closest coordinates of vectors can be interpreted ...
Added: April 10, 2020
Data-driven models and computational tools for neurolinguistics: a language technology perspective
Ekaterina Artemova, Bakarov A., Artemov A. et al., Journal of Cognitive Science 2020 Vol. 1 No. 21 P. 15–52
In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of ​brain imaging-based neurolinguistics studies with a focus on the natural language representations, such as word embeddings and pre-trained language model. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural ...
Added: January 17, 2020
Learning Word Embeddings without Context Vectors
Zobnin A., Elistratova E., , in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)Issue W19-43.: Association for Computational Linguistics, 2019. P. 244–249.
Most word embedding algorithms such as word2vec or fastText construct two sort of vectors: for words and for contexts. Naive use of vectors of only one sort leads to poor results. We suggest using indefinite inner product in skip-gram negative sampling algorithm. This allows us to use only one sort of vectors without loss of ...
Added: November 9, 2019
A Dataset for Noun Compositionality Detection for a Slavic Language
Puzyrev D., Shelmanov A., Panchenko A. et al., , in: Proceedings of the 7th Workshop on Balto-Slavic Natural Language Processing, 2019, Florence, Italy, Association for Computational Linguistics.: Association for Computational Linguistics, 2019. P. 56–62.
aper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following ...
Added: October 30, 2019
Noun Compositionality Detection using Distributional Semantics for the Russian Language
Puzyrev D. A., Shelmanov A., Panchenko A. et al., , in: Analysis of Images, Social Networks and Texts. 8th International Conference AIST 2019.: Springer, 2019. P. 218–229.
In this paper, we present the first gold-standard corpus of Russian noun compounds annotated with compositionality information. We used Universal Dependency treebanks to collect noun compounds according to part of speech patterns, such as ADJ-NOUN or NOUN-NOUN and annotated them according to the following schema: a phrase can be either compositional, non-compositional, or ambiguous (i.e., ...
Added: October 30, 2019
The cues for rhetorical relations in Russian: "Cause-Effect" relation in Russian Rhetorical Structure Treebank
Toldova S., Pisarevskaya D., Vasilyeva M. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 30 мая — 2 июня 2018 г.)Вып. 17(24).: М.: Издательский центр «Российский государственный гуманитарный университет», 2018. P. 747–761.
The purpose of the paper is to investigate cues signalling the relations between discourse units in Russian. Building a lexicon of discourse connectives is an indispensable subtask in many discourse parsing applications as well as an essential issue in theoretical researches of text coherence. In order to develop such a resource for Russian, we have ...
Added: September 1, 2018
Text classification with deep learning neural networks
Voronkov Ilia, Amajd M., Kaimuldenov Z., , in: Actual Problems of System and Software Engineering 2017. Proceedings of the 5th International Conference on Actual Problems of System and Software Engineering Supported by Russian Foundation for Basic Research. Project #17-07-20565 Moscow, Russia, November 14-16, 2017, 408 P.Vol. 1989.: Aachen: CEUR Workshop Proceedings, 2017. P. 362–370.
In this paper, we analyze the use of different neural networks for the text classification task. The accuracy of the studied text classifiers can be changed by a small number of previously classified texts. This is important due to the fact that in many applications of text classification a large number of unlabeled texts are easily accessible, while ...
Added: August 16, 2018
Rotations and Interpretability of Word Embeddings: The Case of the Russian Language
Zobnin A., , in: Analysis of Images, Social Networks and Texts. 6th International Conference, 2017, Revised Selected PapersVol. 10716.: Cham: Springer, 2018. Ch. 11 P. 116–128.
Consider a continuous word embedding model. Usually, the cosines between word vectors are used as a measure of similarity of words. These cosines do not change under orthogonal transformations of the embedding space. We demonstrate that, using some canonical orthogonal transformations from SVD, it is possible both to increase the meaning of some components and ...
Added: November 26, 2017
Extracting social networks from literary text with word embedding tools
Wohlgenannt G., Artemova E., Ilvovsky D., , in: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH).: Osaka: [б.и.], 2016. Ch. 4 P. 18–26.
In this paper a social network is extracted from a literary text. The social network shows, how frequent the characters interact and how similar their social behavior is. Two types of similarity measures are used: the first applies co-occurrence statistics, while the second exploits cosine similarity on different types of word embedding vectors. The results ...
Added: March 6, 2017
Exploration of register-dependent lexical semantics using word embeddings
Kutuzov A. B., Kuzmenko E., Marakasova A., , in: Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH).: Osaka: [б.и.], 2016. P. 26–34.
We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they ...
Added: November 12, 2016
Redefining part-of-speech classes with distributional semantic models
Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.: Berlin: Association for Computational Linguistics, 2016. P. 115–125.
This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work  targets  the  Universal  PoS  tag  set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...
Added: November 12, 2016
Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing
Kutuzov A. B., Козлова О. С., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)Вып. 15.: М.: Изд-во РГГУ, 2016. P. 288–300.
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora ...
Added: November 12, 2016
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit