An Opinion Word Lexicon and a Training Dataset for Russian Sentiment Analysis of Social Media

Koltsova O. Yu; Alexeeva S. V.; Kolcov S. N.

?

An Opinion Word Lexicon and a Training Dataset for Russian Sentiment Analysis of Social Media

P. 277–287.

Koltsova O. Yu, Alexeeva S. V., Kolcov S. N.

Automatic assessment of sentiment in large text corpora is an important goal in social sciences. This paper describes a methodology and the results of the development of a system for Russian language sentiment analysis that includes: a publicly available sentiment lexicon, a publicly available test collection with sentiment markup and a crowdsourcing website for such markup. The lexicon is aimed at detecting sentiment in user-generated content (blogs, social media) related to social and political issues. Its proto- type was formed based on other dictionaries and on the topic modeling per- formed on a large collection of blog posts. Topic modeling revealed relevant (social and political) topics and as a result—relevant words for the lexicon prototype and relevant texts for the training collection. Each word was as- sessed by at least three volunteers in the context of three di erent texts where the word occurred while the texts received their sentiment scores from the same volunteers as well. Both texts and words were scored from −2 (negative) to +2 (positive). Of 7,546 candidate words, 2,753 got non-neu- tral sentiment scores. The quality of the lexicon was assessed with SentiSt- rength software by comparing human text scores with the scores obtained automatically based on the created lexicon. 93% of texts were classi ed correctly at the error level of ±1 class, which closely matches the result of SentiStrength initial application to the English language tweets. Negative classes were much larger and better predicted. The lexicon and the text col- lection are publicly available at http://linis-crowd.org.

Language: English

Full text

Text on another site

Keywords: crowdsourcing topic modeling LiveJournal Russian blogosphere sentiment lexicon web interface sentiment markup test collection

Publication based on the results of:

Разработка общедоступной базы данных и краудсорсингового веб-ресурса для создания инструментов сентимент-анализа (2014)

In book

Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)

Вып. 15. , М.: Изд-во РГГУ, 2016.

Optimizing Modality Weights in Topic Models of Transactional Data

Khrylchenko K., Vorontsov K. V., Automation and Remote Control 2022 Vol. 83 No. 12 P. 1908–1922

Added: November 19, 2025

Человеческая агентность как фактор успеха корпораций

Sorokin P. S., Afanaseva I., Мониторинг общественного мнения: Экономические и социальные перемены 2025 № 4 С. 202–224

The article is devoted to the study of manifestations and methods of supporting agentic (i.e. transforming the environment in a direction not determined by it) behavior as a factor of success of contemporary corporations in the condition of neo-structuration, that is, a new phase of societal evolution, which assumes a change in the relationship between ...

Added: September 5, 2025

From productivity to wellbeing? Topic modelling of doctoral education research

Smirnov N., Higher Education 2025

Doctoral education has undergone significant transformations over the past two decades, driven by massification, internationalization, and the diversification of training models. These shifts have led to a growing body of research on doctoral education, yet little is known about the overarching thematic and geographical trends shaping this field. This study applies computational natural language processing ...

Added: May 26, 2025

Цифровое моделирование тематического поля изучения социального капитала поколений в организациях

Volkova N., Бордунос А. К., Чикер В. А. et al., Социальная психология и общество 2025 Т. 16 № 1 С. 5–27

Objective. Identify key topics presented in contemporary research on the relationship between social capital and generational differences in organizations, utilizing digital processing approaches on a dataset of scientific publications. Background. The emergence of new technologies, labor migration, and the involvement of representatives of different generations in labor activities have highlighted the process of continuous socialization of individuals in ...

Added: May 5, 2025

Войти через госуслуги? Факторы отношения к сервисам электронного правительства в социальных медиа

Егоров В. Ю., Philippov I., Akhremenko A. S., Мониторинг общественного мнения: Экономические и социальные перемены 2025 № 1 С. 214–239

The focus of the work is related to the public perception of government practices within the framework of digitalization policy. Electronic practices of interaction with the government have long been widespread among most Russians. This is confirmed by both public opinion polls and Russia’s high positions in the world rankings of e-government development. In this ...

Added: May 1, 2025

Censorship as a Dissociative Force: A Case of Sovremennik Magazine, 1847–1866

Vozhik E., Maslinsky K., Lisiukov R., CEUR Workshop Proceedings 2024 P. 938–949

The article focuses on the systemic effects of censorship that manifest themselves in the content of published materials that successfully passed the censorship filters. We understand censorship as a special kind of collective imagination about the (in)acceptable, inherent in a particular political context and influencing the decision-making logic by different actors. The idea is that ...

Added: April 3, 2025

Using topic modeling for communities clusterization in the VKontakte social network

Gorshkov S., Ilyushin E., Chernysheva A. et al., International Journal of Open Information Technologies 2021 Vol. 9 No. 5 P. 12–17

Topic modeling is one of the most widely used methods in text analysis. It can be used to select topics as well as to find the topics distributed in each document from the corpus. In this article, we present a method for clustering communities in the social network VKontakte (the most popular Russian social network) ...

Added: December 25, 2024

TEXTS OF DIFFERENT EMOTIONAL CLASSES AND THEIR TOPIC MODELING

Kolmogorova A., Qiuhua S., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2024 Vol. 23 No. 5 P. 60–71

The article is devoted to studying verbalization specifics of various emotional states in the texts in Russian with the purpose to confirm or refute the hypothesis that texts of different emotional classes reflect the denotative situation not identically, which is reflected in thematic specifics and lexical content. The research material consisted of eight corpus texts ...

Added: November 29, 2024

Contest design and solvers' engagement behaviour in crowdsourcing: The neo-configurational perspective

Tekic A., Alfonzo Pacheco D. V., Technovation 2024 Vol. 132 Article 102986

Companies face the challenges of attracting solvers and motivating them to dedicate their time and effort to develop solutions in crowdsourcing contests. Previous research emphasizes the importance of crowdsourcing contest design for fostering solvers' engagement. However, even though contests are designed as a combination of various design elements, such as seeker's identity disclosure, seeker's status, ...

Added: March 5, 2024

Topic models with elements of neural networks: investigation of stability, coherence, and determining the optimal number of topics

Sergei Koltcov, Surkov A., Filippov V. et al., PeerJ Computer Science 2024 Vol. 10 P. 41

Topic modeling is a widely used instrument for the analysis of large text collections. In the last few years, neural topic models and models with word embeddings have been proposed to increase the quality of topic solutions. However, these models were not extensively tested in terms of stability and interpretability. Moreover, the question of selecting the number of topics ...

Added: February 16, 2024

Сила и слабость: динамика репрезентации гегемонной маскулинности в русскоязычном рэпе

Zhuchkova S., Бойченко А. Е., Smirnov N., Журнал социологии и социальной антропологии 2024 Т. 27 № 1 С. 103–138

In public and academic debate, rap is often presented as one of the most aggressive music genres, depicting violence and cruelty in various ways. One of the reasons for that is rap’s social background. It emerged in the criminal area of New York first created by the deprived Black population. Using the notion of hegemonic ...

Added: February 11, 2024

О прошлом, но в разное время: компьютерный анализ текстов учебников по истории СССР/России для шести поколений студентов

Kolmogorova A., Колмогорова П. А., Куликова Е. Р., Вестник Томского государственного университета. Филология 2024 № 89 С. 73–103

In this article, we focus on the analysis of the texts of three history textbooks for university students published at different times: in 1946, in 1983 and in 2006. As a material, we use texts devoted in each of the textbooks to seven historical topics since the beginnings of Kiev principality till the Reforms of ...

Added: December 10, 2023

Тематическое моделирование для коротких текстов: сравнительный анализ

Vashchenko V., Социология: методология, методы, математическое моделирование 2023 № 56 С. 69–112

The steady increase in the popularity of social media as a means of communication actualizes methodological issues related to processing of short texts with less semantic context than large corpora, which are widely used for training and testing machine learning models for textual data. Topic modeling, an unsupervised machine learning technique aimed at aggregating texts ...

Added: December 7, 2023

Индивидуальная «агентность» как элемент человеческого потенциала: виды, проявления и эффекты в корпоративном секторе. Научный дайджест №10 (27)

Sorokin P. S., Afanaseva I., Шмаевка В. К. et al., М.: Издательский дом НИУ ВШЭ, 2023.

The issue of agency (enterprise, initiative) is one of the central ones for the corporate sector. The key factor determining the importance of this issue is the processes of ‘destructuration’, that is, the growth of variability in the forms of social organization in various spheres of public life. The authors identified three levels of proactive behavior ...

Added: November 16, 2023

Конструирование образа города в официальной и обыденной коммуникации: сравнительный анализ (на материале социальных медиа)

Matkin N., Коммуникации. Медиа. Дизайн 2025 Т. 10 № 3 С. 89–110

The article offers an analysis and visualization of Russian city images that emerge in the comments of urban community subscribers and posts from administrative press services. The city image is regarded as a frame structure that develops through political and interpersonal communication in the network. The social component of the city image is identified as ...

Added: November 15, 2023

Компьютерное моделирование как инструмент анализа художественного текста

Kolmogorova A., Залевская Е. Д., Филологический класс 2023 Т. 28 № 2 С. 22–33

The article investigates the issue of heuristic productivity of using the method of computer-assisted topic modeling for philological analysis of fiction text. The study analyzes the results of applying the algorithm of Latent Placement Dirichlet (LDA) for searching intertextual connections of motifs in two sub-corpora of fiction texts: 62 texts of different genres (stories, essays, ...

Added: October 31, 2023

ИНЖЕНЕРНЫЕ ЛИНГВИСТИЧЕСКИЕ ТЕХНОЛОГИИ В ИССЛЕДОВАНИИ ТЕКСТА

Kolmogorova A., Terra Linguistica 2023 Т. 14 № 1 С. 7–10

The publication is devoted to the analysis of the current state of engineering linguistics, its main directions and research challenges. The definition of language technologies and their typology are formulated according to the criterion of the tasks solved with their help. It is noted that the national school of engineering linguistics manages to maintain a ...

Added: October 31, 2023

Quantifying local and mesoscale drivers of the urban heat island of moscow with reference and crowdsourced observations

Varentsov Mikhail, Fenner D., Meier F. et al., Frontiers in Environmental Science 2021 Vol. 9 Article 716968

Urban climate features, such as the urban heat island (UHI), are determined by various factors characterizing the modifications of the surface by the built environment and human activity. These factors are often attributed to the local spatial scale (hundreds of meters up to several kilometers). Nowadays, more and more urban climate studies utilize the concept ...

Added: October 4, 2023