• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Book chapter
  • An Unsupervised Method for Weighting Finite-state Morphological Analyzers
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
July 2, 2026
Researchers Discover How Spelling Errors Slow Down Reading in Russian
Psycholinguists from the Centre for Language and Brain at HSE University–St Petersburg have shown that words that are frequently misspelled are processed more slowly by readers, even when presented with the correct spelling. The researchers confirmed this effect for the first time using Russian-language materials and found that response speed is most strongly linked to how confidently individuals can distinguish the correct spelling of a word from an incorrect one. The study has been published in The Mental Lexicon.
July 2, 2026
HSE Develops App for Assessing Phonological Processing in Children
Researchers at the HSE Centre for Language and Brain have developed a new digital tool for assessing children's phonological processing skills—the ZARYA (Sound Analysis of the Russian Language) test battery. It is the first standardised application in Russia designed to provide a fast and reliable assessment of children's ability to distinguish speech sounds, retain them in working memory, and perform phonemic analysis. The app runs on Android tablets and smartphones and is available for download from RuStore. Details of the test validation have been published in the Journal of Speech, Language, and Hearing Research.
July 1, 2026
Scientists Discover Why Europium 'Misbehaves'
Europium is a rare-earth metal responsible for the pure red glow in displays and other luminescent materials. For a long time, however, it refused to emit light when surrounded by certain organic molecules known as acylpyrazolone ligands. Chemists have now uncovered the reason: in europium complexes with these ligands, a 'black window' appears—a charge-transfer state in which the energy absorbed by the ligand is dissipated as heat rather than emitted as light. Understanding this mechanism opens the way to designing more efficient red-emitting materials for displays, fluorescent thermometers, and chemical sensors. The results have been published in Dalton Transactions.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

An Unsupervised Method for Weighting Finite-state Morphological Analyzers

P. 3842–3850.
Tyers F. M., Keleg A., Pirinen T.

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

Language: English
Full text
Text on another site
Keywords: word2vecconstraint grammarFST weightingFSTs

In book

Proceedings of The 12th Language Resources and Evaluation Conference
Vol. 12. , European Language Resources Association (ELRA), 2020.
Similar publications
Высокоуровневая семантическая интерпретация структуры статических моделей для русского языка
Serikov O., Ganeeva V., Аксенова А. А. et al., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2023 Т. 21 № 1 С. 67–82
Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent ...
Added: April 28, 2025
Adjectivization in Russian: Analyzing participles by means of lexical frequency and constraint grammar
Petrunina U., “Doktorgradsavhandlinger (HSL-fak)” Collections, 2021.
This dissertation explores the factors that restrict and facilitate adjectivization in Russian, an affixless part-of-speech change leading to ambiguity between participles and adjectives. I develop a theoretical framework based on major approaches to adjectivization, and assess the effect of the factors on ambiguity in the empirical data. I build a linguistic model using the Constraint ...
Added: October 2, 2024
You shall know a piece by the company it keeps. Chess plays as a data for word2vec models
Orekhov B., / Series Computer Science "arxiv.org". 2024.
In this paper, I apply linguistic methods of analysis to non-linguistic data, chess plays, metaphorically equating one with the other and seeking analogies. Chess game notations are also a kind of text, and one can consider the records of moves or positions of pieces as words and statements in a certain language. In this article ...
Added: August 8, 2024
Effectiveness of ELMo embeddings, and semantic models in predicting review helpfulness
Malik M. S., Nawaz A., Jamjoom M. M. et al., Intelligent Data Analysis 2024 Vol. 28 No. 4 P. 1045–1065
Online product reviews (OPR) are a commonly used medium for consumers to communicate their experiences with products during online shopping. Previous studies have investigated the helpfulness of OPRs using frequency-based, linguistic, meta-data, readability, and reviewer attributes. In this study, we explored the impact of robust contextual word embeddings, topic, and language models in predicting the ...
Added: February 26, 2024
Конструирование образа города в официальной и обыденной коммуникации: сравнительный анализ (на материале социальных медиа)
Matkin N., Коммуникации. Медиа. Дизайн 2025 Т. 10 № 3 С. 89–110
The article offers an analysis and visualization of Russian city images that emerge in the comments of urban community subscribers and posts from administrative press services. The city image is regarded as a frame structure that develops through political and interpersonal communication in the network. The social component of the city image is identified as ...
Added: November 15, 2023
Identifying emerging trends and hot topics through intelligent data mining: the case of clinical psychology and psychotherapy
Sokolova A., Lobanova P., Kuzminov I., Foresight 2024 Vol. 26 No. 1 P. 155–180
Purpose The purpose of the paper is to present an integrated methodology for identifying trends in a particular subject area based on a combination of advanced text mining and expert methods. The authors aim to test it in an area of clinical psychology and psychotherapy in 2010–2019. Design/methodology/approach The authors demonstrate the way of applying text-mining and the ...
Added: October 12, 2023
How to detect propaganda from social media? Exploitation of semantic and fine-tuned language models
Malik M. S., Imran T., Mona Mamdouh J., PeerJ Computer Science 2023 Vol. 9 Article e1248
Online propaganda is a mechanism to influence the opinions of social media users. It is a growing menace to public health, democratic institutions, and public society. The present study proposes a propaganda detection framework as a binary classification model based on a news repository. Several feature models are explored to develop a robust model such ...
Added: September 4, 2023
Automated defect identification for cell phones using language context, linguistic and smoke-word models
Muhammad Z. Y., Malik M. S., Ignatov D. I., Expert Systems with Applications 2023 Vol. 227 Article 120236
Product defects are a widespread concern for manufacturers when conducting quality and customer relationship management. Prior approaches addressed many electronic products however cell phones are still unexplored. Moreover, prior work mainly focused on the lexicon, probabilistic graphic, failure mode, and effect analysis models but the utilization of word embeddings and language models are not explored. State-of-the-art contextual word embeddings and language models generate automated features and ...
Added: June 13, 2023
Detection of semantic changes in Russian nouns with distributional models and grammatical features
Ryzhova A., Ryzhova D., Sochenkov I., , in: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (2021)Issue 20: Основной том.: -, 2021. P. 597–606.
Added: October 30, 2021
Automated Analysis of Discourse Coherence in Schizophrenia: Approximation of Manual Measures
Ryazanskaya G., Khudyakova M., , in: Proceedings of the LREC 2020 Workshop on: Resources and Processing of Linguistic, Para-linguistic and Extra-linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments (RaPID-3).: European Language Resources Association (ELRA), 2020. P. 98–107.
Disorganized, or incoherent, speech is one of the important criteria for diagnosing schizophrenia. However, there is still a lack of a rather quick objective method of measuring speech coherence. Automated discourse analysis is a possible solution to this problem. We analyzed discourse coherence in a set of spoken narratives by people with schizophrenia and neurotypical speakers ...
Added: February 2, 2021
Learning Word Embeddings without Context Vectors
Zobnin A., Elistratova E., , in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)Issue W19-43.: Association for Computational Linguistics, 2019. P. 244–249.
Most word embedding algorithms such as word2vec or fastText construct two sort of vectors: for words and for contexts. Naive use of vectors of only one sort leads to poor results. We suggest using indefinite inner product in skip-gram negative sampling algorithm. This allows us to use only one sort of vectors without loss of ...
Added: November 9, 2019
WORD VECTOR MODELS AS AN OBJECT OF LINGUISTIC RESEARCH
Shavrina T., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.)Вып. 18(25).: [б.и.], 2019. P. 576–588.
This article launches a series of studies in which popular vector word2vec models are considered not as an element of the architecture of an NLP application, but as an independent object of linguistic research. The linguist's view on the surrogate of contexts on the corpus, as which vector models can be considered, makes it possible ...
Added: September 5, 2019
Extraction of Hypernyms from Dictionaries with a Little Help from Word Embeddings
Karyaeva M., Braslavski P., Kiselev Y., , in: Analysis of Images, Social Networks and Texts. 7th International Conference AIST 2018.: Springer, 2018. P. 76–87.
The paper investigates several techniques for hypernymy extraction from a large collection of dictionary definitions in Russian. First, definitions from different dictionaries are clustered, then single words and multiwords are extracted as hypernym candidates. A classification-based approach on pre-trained word embeddings is implemented as a complementary technique. In total, we extracted about 40K unique hypernym ...
Added: March 11, 2019
Webvectors: A toolkit for building web interfaces for vector semantic models
Kutuzov A., Kuzmenko E., , in: Supplementary Proceedings of the 5th International Conference on Analysis of Images, Social Networks and Texts (AIST-SUP 2016), Yekaterinburg, Russia, April 7-9, 2016.Vol. 1710.: Aachen: CEUR Workshop Proceedings, 2016. P. 155–161.
The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this) and dissemination of the results to general public. Our toolkit, WebVectors, provides all the necessary routines for ...
Added: April 20, 2017
Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing
Kutuzov A. B., Козлова О. С., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)Вып. 15.: М.: Изд-во РГГУ, 2016. P. 288–300.
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora ...
Added: November 12, 2016
Automated Word Sense Frequency Estimation for Russian Nouns
Lopukhina A., Лопухин К. А., Носырев Г. В., , in: Quantitative approaches to the Russian language.: Abingdon: Routledge, 2018. P. 79–94.
According to G. K. Zipf’s observation, there is a strong correlation between word frequency and polysemy. Yet word sense frequency distribution is a neglected area in computational linguistics. Furthermore, the study of sense frequency has theoretical interest and practical applications for lexicography and word sense disambiguation. Although WordNet and SemCor contain some information about sense frequency ...
Added: October 11, 2016
Word Sense Disambiguation for Russian Verbs Using Semantic Vectors and Dictionary Entries
Lopukhina A., Лопухин К. А., Компьютерная лингвистика и интеллектуальные технологии 2016 No. 15 P. 393–405
Word sense disambiguation (WSD) methods are useful for many NLP tasks that require semantic interpretation of input. Furthermore, such methods can help estimate word sense frequencies in different corpora, which is important for lexicographic studies and language learning resources. Although previous research on Russian polysemous verbs disambiguation established some important and interesting results, it was mostly ...
Added: October 11, 2016
Texts in, meaning out: neural language models in semantic similarity task for Russian
Kutuzov A. B., Andreev I., , in: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” (2015)Issue 14(21).: M.: Russian State University for the Humanitie, 2015. P. 143–154.
Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from 2nd ...
Added: May 31, 2015
Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian
Kutuzov A. B., Kuzmenko E., , in: Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer ScienceVol. 9041.: Springer, 2015. P. 47–58.
In this paper we compare the Russian National Corpus to a larger Russian web corpus composed in 2014; the assumption behind our work is that the National corpus, being limited by the texts it contains and their proportions, presents lexical contexts (and thus meanings) which are different from those found ‘in the wild’ or in ...
Added: April 23, 2015
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit