An Unsupervised Method for Weighting Finite-state Morphological Analyzers

?

An Unsupervised Method for Weighting Finite-state Morphological Analyzers

P. 3842–3850.

Tyers F. M., Keleg A., Pirinen T.

Morphological analysis is one of the tasks that have been studied for years. Different techniques have been used to develop models for performing morphological analysis. Models based on finite state transducers have proved to be more suitable for languages with low available resources. In this paper, we have developed a method for weighting a morphological analyzer built using finite state transducers in order to disambiguate its results. The method is based on a word2vec model that is trained in a completely unsupervised way using raw untagged corpora and is able to capture the semantic meaning of the words. Most of the methods used for disambiguating the results of a morphological analyzer relied on having tagged corpora that need to manually built. Additionally, the method developed uses information about the token irrespective of its context unlike most of the other techniques that heavily rely on the word’s context to disambiguate its set of candidate analyses.

Language: English

Full text

Text on another site

In book

Proceedings of The 12th Language Resources and Evaluation Conference

Vol. 12. , European Language Resources Association (ELRA), 2020.

Высокоуровневая семантическая интерпретация структуры статических моделей для русского языка

Serikov O., Ganeeva V., Аксенова А. А. et al., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2023 Т. 21 № 1 С. 67–82

Since its inception, the Word2vec vector space has become a universal tool both for scientific and practical activities. Over time, it became clear that there is a lack of new methods for interpreting the location of words in vector spaces. The existing methods included consideration of analogies or clustering of a vector space. In recent ...

Added: April 28, 2025

Adjectivization in Russian: Analyzing participles by means of lexical frequency and constraint grammar

Petrunina U., “Doktorgradsavhandlinger (HSL-fak)” Collections, 2021.

This dissertation explores the factors that restrict and facilitate adjectivization in Russian, an affixless part-of-speech change leading to ambiguity between participles and adjectives. I develop a theoretical framework based on major approaches to adjectivization, and assess the effect of the factors on ambiguity in the empirical data. I build a linguistic model using the Constraint ...

Added: October 2, 2024

You shall know a piece by the company it keeps. Chess plays as a data for word2vec models

Orekhov B., / Series Computer Science "arxiv.org". 2024.

In this paper, I apply linguistic methods of analysis to non-linguistic data, chess plays, metaphorically equating one with the other and seeking analogies. Chess game notations are also a kind of text, and one can consider the records of moves or positions of pieces as words and statements in a certain language. In this article ...

Added: August 8, 2024

Effectiveness of ELMo embeddings, and semantic models in predicting review helpfulness

Malik M. S., Nawaz A., Jamjoom M. M. et al., Intelligent Data Analysis 2024 Vol. 28 No. 4 P. 1045–1065

Online product reviews (OPR) are a commonly used medium for consumers to communicate their experiences with products during online shopping. Previous studies have investigated the helpfulness of OPRs using frequency-based, linguistic, meta-data, readability, and reviewer attributes. In this study, we explored the impact of robust contextual word embeddings, topic, and language models in predicting the ...

Added: February 26, 2024

Конструирование образа города в официальной и обыденной коммуникации: сравнительный анализ (на материале социальных медиа)

Matkin N., Коммуникации. Медиа. Дизайн 2025 Т. 10 № 3 С. 89–110

The article offers an analysis and visualization of Russian city images that emerge in the comments of urban community subscribers and posts from administrative press services. The city image is regarded as a frame structure that develops through political and interpersonal communication in the network. The social component of the city image is identified as ...

Added: November 15, 2023

Identifying emerging trends and hot topics through intelligent data mining: the case of clinical psychology and psychotherapy

Sokolova A., Lobanova P., Kuzminov I., Foresight 2024 Vol. 26 No. 1 P. 155–180

Purpose The purpose of the paper is to present an integrated methodology for identifying trends in a particular subject area based on a combination of advanced text mining and expert methods. The authors aim to test it in an area of clinical psychology and psychotherapy in 2010–2019. Design/methodology/approach The authors demonstrate the way of applying text-mining and the ...

Added: October 12, 2023

How to detect propaganda from social media? Exploitation of semantic and fine-tuned language models

Malik M. S., Imran T., Mona Mamdouh J., PeerJ Computer Science 2023 Vol. 9 Article e1248

Online propaganda is a mechanism to influence the opinions of social media users. It is a growing menace to public health, democratic institutions, and public society. The present study proposes a propaganda detection framework as a binary classification model based on a news repository. Several feature models are explored to develop a robust model such ...

Added: September 4, 2023

Automated defect identification for cell phones using language context, linguistic and smoke-word models

Muhammad Z. Y., Malik M. S., Ignatov D. I., Expert Systems with Applications 2023 Vol. 227 Article 120236

Product defects are a widespread concern for manufacturers when conducting quality and customer relationship management. Prior approaches addressed many electronic products however cell phones are still unexplored. Moreover, prior work mainly focused on the lexicon, probabilistic graphic, failure mode, and effect analysis models but the utilization of word embeddings and language models are not explored. State-of-the-art contextual word embeddings and language models generate automated features and ...

Added: June 13, 2023

Detection of semantic changes in Russian nouns with distributional models and grammatical features

Ryzhova A., Ryzhova D., Sochenkov I., , in: Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (2021)Issue 20: Основной том.: -, 2021. P. 597–606.

Added: October 30, 2021

Automated Analysis of Discourse Coherence in Schizophrenia: Approximation of Manual Measures

Ryazanskaya G., Khudyakova M., , in: Proceedings of the LREC 2020 Workshop on: Resources and Processing of Linguistic, Para-linguistic and Extra-linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments (RaPID-3).: European Language Resources Association (ELRA), 2020. P. 98–107.

Disorganized, or incoherent, speech is one of the important criteria for diagnosing schizophrenia. However, there is still a lack of a rather quick objective method of measuring speech coherence. Automated discourse analysis is a possible solution to this problem. We analyzed discourse coherence in a set of spoken narratives by people with schizophrenia and neurotypical speakers ...

Added: February 2, 2021

Learning Word Embeddings without Context Vectors

Zobnin A., Elistratova E., , in: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)Issue W19-43.: Association for Computational Linguistics, 2019. P. 244–249.

Most word embedding algorithms such as word2vec or fastText construct two sort of vectors: for words and for contexts. Naive use of vectors of only one sort leads to poor results. We suggest using indefinite inner product in skip-gram negative sampling algorithm. This allows us to use only one sort of vectors without loss of ...

Added: November 9, 2019

WORD VECTOR MODELS AS AN OBJECT OF LINGUISTIC RESEARCH

Shavrina T., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 29 мая — 1 июня 2019 г.)Вып. 18(25).: [б.и.], 2019. P. 576–588.

This article launches a series of studies in which popular vector word2vec models are considered not as an element of the architecture of an NLP application, but as an independent object of linguistic research. The linguist's view on the surrogate of contexts on the corpus, as which vector models can be considered, makes it possible ...

Added: September 5, 2019

Extraction of Hypernyms from Dictionaries with a Little Help from Word Embeddings

Karyaeva M., Braslavski P., Kiselev Y., , in: Analysis of Images, Social Networks and Texts. 7th International Conference AIST 2018.: Springer, 2018. P. 76–87.

The paper investigates several techniques for hypernymy extraction from a large collection of dictionary definitions in Russian. First, definitions from different dictionaries are clustered, then single words and multiwords are extracted as hypernym candidates. A classification-based approach on pre-trained word embeddings is implemented as a complementary technique. In total, we extracted about 40K unique hypernym ...

Added: March 11, 2019

Webvectors: A toolkit for building web interfaces for vector semantic models

Kutuzov A., Kuzmenko E., , in: Supplementary Proceedings of the 5th International Conference on Analysis of Images, Social Networks and Texts (AIST-SUP 2016), Yekaterinburg, Russia, April 7-9, 2016.Vol. 1710.: Aachen: CEUR Workshop Proceedings, 2016. P. 155–161.

The paper presents a free and open source toolkit which aim is to quickly deploy web services handling distributed vector models of semantics. It fills in the gap between training such models (many tools are already available for this) and dissemination of the results to general public. Our toolkit, WebVectors, provides all the necessary routines for ...

Added: April 20, 2017

Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing

Kutuzov A. B., Козлова О. С., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)Вып. 15.: М.: Изд-во РГГУ, 2016. P. 288–300.

In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora ...

Added: November 12, 2016

Automated Word Sense Frequency Estimation for Russian Nouns

Lopukhina A., Лопухин К. А., Носырев Г. В., , in: Quantitative approaches to the Russian language.: Abingdon: Routledge, 2018. P. 79–94.

According to G. K. Zipf’s observation, there is a strong correlation between word frequency and polysemy. Yet word sense frequency distribution is a neglected area in computational linguistics. Furthermore, the study of sense frequency has theoretical interest and practical applications for lexicography and word sense disambiguation. Although WordNet and SemCor contain some information about sense frequency ...

Added: October 11, 2016

Word Sense Disambiguation for Russian Verbs Using Semantic Vectors and Dictionary Entries

Lopukhina A., Лопухин К. А., Компьютерная лингвистика и интеллектуальные технологии 2016 No. 15 P. 393–405

Word sense disambiguation (WSD) methods are useful for many NLP tasks that require semantic interpretation of input. Furthermore, such methods can help estimate word sense frequencies in different corpora, which is important for lexicographic studies and language learning resources. Although previous research on Russian polysemous verbs disambiguation established some important and interesting results, it was mostly ...

Added: October 11, 2016

Texts in, meaning out: neural language models in semantic similarity task for Russian

Kutuzov A. B., Andreev I., , in: Computational Linguistics and Intellectual Technologies. Papers from the Annual International Conference “Dialogue” (2015)Issue 14(21).: M.: Russian State University for the Humanitie, 2015. P. 143–154.

Distributed vector representations for natural language vocabulary get a lot of attention in contemporary computational linguistics. This paper summarizes the experience of applying neural network language models to the task of calculating semantic similarity for Russian. The experiments were performed in the course of Russian Semantic Similarity Evaluation track, where our models took from 2nd ...

Added: May 31, 2015

Comparing Neural Lexical Models of a Classic National Corpus and a Web Corpus: The Case for Russian

Kutuzov A. B., Kuzmenko E., , in: Computational Linguistics and Intelligent Text Processing, Lecture Notes in Computer ScienceVol. 9041.: Springer, 2015. P. 47–58.

In this paper we compare the Russian National Corpus to a larger Russian web corpus composed in 2014; the assumption behind our work is that the National corpus, being limited by the texts it contains and their proportions, presents lexical contexts (and thus meanings) which are different from those found ‘in the wild’ or in ...

Added: April 23, 2015