Authorship Attribution in Russian with New High-Performing and Fully Interpretable Morpho-Syntactic Features

Pimonova E.; O. Durandin; A. Malafeev

doi:10.1007/978-3-030-37334-4_18

Publications

?

Authorship Attribution in Russian with New High-Performing and Fully Interpretable Morpho-Syntactic Features

P. 193–204.

Pimonova E., Durandin O., Malafeev A.

This work tackles the problem of modeling author style in Russian. In particular, we solve the task of authorship attribution using the collected dataset of 30 authors, 1506 texts written in the period of 18th – 21st century. We apply various approaches to solving the attribution problem: Random Forest, Logistic Regression, SVM Classifier. In terms of text representation, we use seven models in three language levels: lexis, morphology, and syntax. Most importantly, we propose our own set of morpho-syntactic features that perform on about the same level as doc2vec, but are fully interpretable. The conducted experiments show the effectiveness of their standalone use, as well as the increase in the quality of classification when using these attributes along with the classic doc2vec-based approach. All code, including feature extraction, is made freely available. Additionally, we analyze the performance of individual features as style markers. Finally, we study classification errors in order to identify the patterns in the misattribution of specific authors.

Language: English

Full text

DOI

Keywords: машинное обучение natural language processing автоматическая обработка естественного языка machine learning authorship attribution авторский стиль Text representation text classification Author Style morpho-syntactic features language feature engineering определение авторства классификация текста формальное представление текста морфосинтаксические признаки разработка языковых признаков

In book

Analysis of Images, Social Networks and Texts. 8th International Conference, AIST 2019, Lecture Notes in Computer Science, Revised Selected Papers

Vol. 11832. , Cham: Springer, 2019.

Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

Berlin: Association for Computational Linguistics, 2016.

The 2016 Conference on Computational Natural Language Learning is the twentieth in the series of annual meetings organized by SIGNLL, the ACL special interest group on natural language learning. CoNLL 2016 will be held on August 11-12, 2016, and is co-located with the 54th annual meeting of the Association for Computational Linguistics (ACL) in Berlin, ...

Added: November 12, 2016

8th Russian Summer School in Information Retrieval (RuSSIR 2014)

Braslavski P., Karpov Nikolay, Worring M. et al., ACM SIGIR Forum 2014 Vol. 48 No. 2 P. 105–110

The 8th Russian Summer School in Information Retrieval (RuSSIR 2014) was held on August 18-22, 2014 in Nizhniy Novgorod, Russia.1 The school was co-organized by the National Research University Higher School of Economics2 and the Russian Information Retrieval Evaluation Seminar (ROMIP) ...

Added: August 22, 2015

Redefining part-of-speech classes with distributional semantic models

Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning.: Berlin: Association for Computational Linguistics, 2016. P. 115–125.

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...

Added: November 12, 2016

Pulse of the Nation: Observable Subjective Well-Being in Russia Inferred from Social Network Odnoklassniki

Sergey Smetanin, Mathematics 2022 Vol. 10 No. 16 Article 2947

Policymakers and researchers worldwide are interested in measuring the subjective well-being (SWB) of populations. In recent years, new approaches to measuring SWB have begun to appear, using digital traces as the main source of information, and show potential to overcome the shortcomings of traditional survey-based methods. In this paper, we propose the formal model for ...

Added: August 15, 2022

NRU-HSE at SemEval-2017 Task 4: Tweet Quantification Using Deep Learning Architecture

Karpov N., , in: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017).: Vancouver: Association for Computational Linguistics, 2017. P. 683–688.

In many areas, such as social science, politics or market research, people need to deal with dataset shifting over time. Distribution drift phenomenon usually appears in the field of sentiment analysis, when proportions of instances are changing over time. In this case, the task is to correctly estimate proportions of each sentiment expressed in the ...

Added: November 14, 2017

Современные проблемы и тенденции компьютерной лингвистики

Toldova S., Lyashevskaya O., Вопросы языкознания 2014 № 1 С. 120–145

This paper is an overview of the current issues and tendencies in Computational linguistics. The overview is based on the materials of the conference on computational linguistics COLING’2012. The modern approaches to the traditional NLP domains such as pos-tagging, syntactic parsing, machine translation are discussed. The highlights of automated information extraction, such as fact extraction, ...

Added: October 15, 2013

Применение методов машинного обучения для решения задачи автоматической рубрикации статей по УДК

Romanov A., Ломотин К. Е., Козлова Е. С., Информационные технологии 2017 Т. 23 № 6 С. 418–423

The paper deals with the applicability of modern machine learning methods to the problem of automatic generation of UDC for scientific articles. As the classifiers, such models as artificial neural networks, logistic regression and boosting are considered. Graph algorithms and a prototype software module to generate UDC are designed. ...

Added: July 30, 2017

Analysis of Images, Social Networks and Texts Third International Conference, AIST 2014, Yekaterinburg, Russia, April 10-12, 2014, Revised Selected Papers

Berlin: Springer, 2014.

This book constitutes the proceedings of the Third International Conference on Analysis of Images, Social Networks and Texts, AIST 2014, held in Yekaterinburg, Russia, in April 2014. The 11 full and 10 short papers were carefully reviewed and selected from 74 submissions. They are presented together with 3 short industrial papers, 4 invited papers and ...

Added: November 13, 2014

Texterra: инфраструктура для анализа текстов

Денис Турдаков, Астраханцев Н. А., Недумов Я. Р. et al., Труды Института системного программирования РАН 2014 Т. 26 С. 421–438

he paper presents a framework for fast text analytics developed during the Texterra project. Texterra is a technology for multilingual text mining based on novel text processing methods that exploit knowledge extracted from user-generated content. It delivers a fast scalable solution for text mining without the expensive customization. Depending on use-cases Texterra could be utilized ...

Added: November 6, 2017

Supplementary Proceedings of the 3rd International Conference on Analysis of Images, Social Networks and Texts (AIST 2014)

Ekaterinburg: CEUR Workshop Proceedings, 2014.

AIST'2014 is an international data science conference on Analysis of Images, Social Networks, and Texts. Traditionally, the conference is held annually in Yekaterinburg, Russia. The conference is intended for computer scientists and practitioners whose research interests involve Internet mathematics and other related fields of data science. LIST OF TOPICS (NON EXHAUSTIVE) Applications of Data Mining and Machine ...

Added: August 28, 2014

Share of Toxic Comments among Different Topics: The Case of Russian Social Networks

Smetanin S., Komarov M. M., , in: IEEE 23rd Conference on Business Informatics (CBI).: IEEE Computer Society, 2021. P. 65–70.

With the widespread use of online social networks, it is becoming more and more difficult to monitor and analyse all the user-generated content. Toxic speech in online conversations should be treated as a matter with serious social gravity, since it may result in both negative impacts on mental health and violent actions in the physical ...

Added: September 14, 2021

Automatic Morpheme Segmentation for Russian: Can an Algorithm Replace Experts?

Morozov D., Garipov T., Lyashevskaya O. et al., Journal of Language and Education 2024 Vol. 10 No. 4 P. 71–84

Introduction: Numerous algorithms have been proposed for the task of automatic morpheme segmentation of Russian words. Due to the differences in task formulation and datasets utilized, comparing the quality of these algorithms is challenging. It is unclear whether the errors in the models are due to the ineffectiveness of algorithms themselves or to errors and inconsistencies ...

Added: January 7, 2025

Comparative analysis of classification methods for text in UDC code generation problem for scientific articles

Lomotin K. E., Kozlova E. S., Romanov A., , in: Information Innovative Technologies: Materials of the International scientific–рractical conference.: M.: Association of graduates and employees of AFEA named after prof. Zhukovsky, 2017. P. 359–363.

The research is devoted to studying of applicability of most relevant modern classification methods to the issue of automatic universal decimal classificator code generation for arbitrary scientific article. The next methods are considered as classifiers: artificial neural network, logistic regression, naive Bayesian classifier and metrical ...

Added: July 30, 2017

9th Russian Summer School in Information Retrieval (RuSSIR 2015)

Braslavski P., Markov I., Pardalos P. M. et al., ACM SIGIR Forum 2016 Vol. 49 No. 2 P. 72–79

This paper provides the reader with a report on 9th Russian Summer School in Information Retrieval (RuSSIR 2015). ...

Added: February 27, 2017

Breaking Sticks and Ambiguities with Adaptive Skip-gram

Bartunov S., Кондрашкин Д. А., Osokin A. et al., / Series arXiv:1502.07257 "Computation and language". 2015.

Recently proposed Skip-gram model is a powerful method for learning high-dimensional word representations that capture rich semantic relationships between words. However, Skip-gram as well as most prior work on learning word representations does not take into account word ambiguity and maintain only single representation per word. Although a number of Skip-gram modifications were proposed to ...

Added: November 5, 2015

A Deep Learning Method Study of User Interest Classification

Malafeev A., Nikolaev K., , in: Analysis of Images, Social Networks and Texts. 8th International Conference, AIST 2019, Kazan, Russia, July 17–19, 2019, Revised Selected Papers. Communications in Computer and Information ScienceVol. 1086.: Springer, 2020. P. 154–159.

In this paper, a deep learning method study is conducted to solve a new multiclass text classification problem, identifying user interests by text messages. We used an original dataset of almost 90 thousand forum text messages, labeled for ten interests. We experimented with different modern neural network architectures: recurrent and convolutional, as well as simpler ...

Added: November 7, 2019

Alexander Kotov, Elena Treshcheva, Leonid Bessonov, Dmitry I. Ignatov, Yana Volkovich, Maria Eskevich, Pavel Braslavski: 10th Russian Summer School in Information Retrieval (RuSSIR 2016)

Kotov A., Treshcheva E., Bessonov L. et al., SIGIR Forum (ACM Special Interest Group on Information Retrieval) 2016 Vol. 50 No. 2 P. 28–35

This paper provides the reader with a report on 10th Russian Summer School in Information Retrieval (RuSSIR 2016). ...

Added: February 27, 2017

Identifying and Visualizing Trends in Science, Technology, and Innovation Using SciBERT

Lobanova P., Bakhtin P., Sergienko Y., IEEE Transactions on Engineering Management 2024 No. 71 P. 11898–11906

Identification of science, technology, and innovation trends is a critical topic both for the scientific community and for companies that develop technologies, work on science and technology policy or invest in high tech. In this research authors demonstrate a novel approach implemented in iFORA system (developed by National Research University Higher School of Economics) using ...

Added: September 8, 2023

Classification of Short Scientific Texts

I. K. Kusakin, Fedorets O. V., A. Y. Romanov, Scientific and Technical Information Processing 2023 Vol. 50 No. 3 P. 176–183

This paper discusses modern approaches to natural language processing and the application of machine learning models to the task of classifying short scientific texts in Russian. This study is devoted to the analysis of methods for vectorization of textual information, selection of a model for scientific paper clas- sification, and training of linguistic model BERT ...

Added: November 4, 2023

Referential Choice: Predictability and Its Limits

Kibrik A. A., Khudyakova M., Dobrov G. B. et al., Frontiers in Psychology 2016 Vol. 7 No. 1429 P. 1–21

We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, ...

Added: September 28, 2016

Faster variational inducing input Gaussian process classification

Izmailov P., Kropotov D., Journal of machine learning and data analysis 2017 Vol. 3 No. 1 P. 20–35

Background: Gaussian processes (GP) provide an elegant and effective approach to learning in kernel machines. This approach leads to a highly interpretable model and allows using the Bayesian framework for model adaptation and incorporating the prior knowledge about the problem. The GP framework is successfully applied to regression, classification, and dimensionality reduction problems. Unfortunately, the ...

Added: December 6, 2018

Классификация коннектомов на основе локальных метрик на стохастических матрицах

Ivanov A., Petrov D., В кн.: Сборник статей конференции "Информационные технологии и системы" (ИТиС'16).: М.: ИППИ РАН, 2016. С. 509–516.

Многие графовые метрики основаны на предположении, что веса графа представляют расстояния между вершинами, которые мы можем складывать. Если считать эти метрики для стохастических матриц случайного блуждания на графе, то физический смысл вероятностей перехода между вершинами теряется (поскольку вероятности переходов перемножаются, а не складываются). Мы предлагаем решать эту проблему использованием отрицательных логарифмов весов ребер. Используя этот ...

Added: December 15, 2016

Оценка стоимости недвижимости на основе больших данных

Mamedli M., Умнов А. В., Вопросы экономики 2022 № 12 С. 118–136

The paper considers the application of the web scrapping and machine learning algorithms for the assessment of the real estate price on the secondary housing market in Moscow. For this, we collect and process the data from the CIAN website and the data from “Reforma GKH”. To evaluate real estate objects, we consider such machine ...

Added: January 11, 2023

Что в профиле тебе моем: Данные «ВКонтакте» как инструмент изучения интересов современных подростков

Polivanova K. N., Smirnov I., Вопросы образования 2017 № 2 С. 134–152

Children’s interests play a key role in their psychological development. However, research in this field is associated with serious methodological problems, as it has traditionally used questionnaire surveys that cannot adequately describe the diverse and dynamic world of interests of a developing person. The article suggests using the information on VKontakte communities followed by teenagers, ...

Added: July 21, 2017