Semantic Feature Aggregation for Gender Identification in Russian Facebook

P. Panicheva; Mirzagitova A.; Y. Ledovaya

doi:10.1007/978-3-319-71746-3_1

Publications

?

Semantic Feature Aggregation for Gender Identification in Russian Facebook

Ch. 1. P. 3–15.

Panicheva P., Mirzagitova A., Ledovaya Y.

*Реализация соц. сети Facebook запрещена на территории России по основаниям осуществления экстремистской деятельности.

The goal of the current work is to evaluate semantic feature aggregation techniques in a task of gender classification of public social media texts in Russian. We collect Facebook posts of Russian-speaking users and apply them as a dataset for two topic modelling techniques and a distributional clustering approach. The output of the algorithms is applied as a feature aggregation method in a task of gender classification based on a smaller Facebook sample. The classification performance of the best model is favorably compared against the lemmas baseline and the state-of-the-art results reported for a different genre or language. The resulting successful features are exemplified, and the difference between the three techniques in terms of classification performance and feature contents are discussed, with the best technique clearly outperforming the others.

Language: English

Full text

DOI

Keywords: distributional semantics тематическое моделирование topic modelling дистрибутивная семантика author profiling профилирование автора

In book

Artificial Intelligence and Natural Language, 6th Conference, AINL 2017, St. Petersburg, Russia, September 20–23, 2017, Revised Selected Papers

Issue 789. , Switzerland: Springer, 2018.

От любви до ненависти: распределение эмоциональной лексики в русском рассказе начала XX века

Moskvina A., Kirina M., В кн.: Корпусная лингвистика - 2023. [б.и.], 2023.

The paper presents the results of experiments investigating the distribution of emotional vocabulary in Russian short stories of the beginning of the 20th century. The emotionality of words and texts is determined automatically using the methods of distributive semantics, which does not require the use of dictionaries or preliminary data annotation. The results include data ...

Added: December 9, 2023

Анализ клинических путей пациентов в медицинских учреждениях на основе методов жесткой и нечеткой кластеризации

Prokofyeva E. S., Зайцев Р. Д., Бизнес-информатика 2020 Т. 14 № 1 С. 19–31

Modeling the processes in a healthcare system plays a large role in understanding its activities and serves as the basis for increasing the efficiency of medical institutions. The tasks of analyzing and modeling large amounts of urban healthcare data using machine learning methods are of particular importance and relevance for the development of industry solutions ...

Added: May 16, 2020

Entity Based Sentiment Analysis Using Syntax Patterns and Convolutional Neural Network

Karpov I., Kozhevnikov M., Kazorin V. et al., , in: Computational Linguistics and Intellectual Technologies: Proceedings of the Annual International Conference “Dialogue” (2016). М.: Изд-во РГГУ, 2016. P. 225–236.

This paper provides an alternative method to extracting object-based sentiment in text messages, based on modified method previously proposed by Mingbo [8], in which we first parse the syntax, and then correlate the sentiment with the object of analysis (also referred to as entity by some, therefore, used in this article interchangeably). We show two ...

Added: October 7, 2016

Репрезентация этничностей в русскоязычных социальных медиа

Nagornyy O. S., Мониторинг общественного мнения: Экономические и социальные перемены 2017 № 4 С. 165–184

The paper presents the results of a study based on the Big Data paradigm analysis. The study aims at defining the features of the ethnic discourse in the Russian-speaking social media and the place of the North Caucasus ethnicities in this discourse. The informational basis for the study is 2,659,849 social media publications containing ethnonyms. ...

Added: August 21, 2017

Несчастливы по-своему: как измерить тональность литературного текста?

Sherstinova T., Moskvina A., Kirina M. et al., В кн.: Корпусная лингвистика - 2023. [б.и.], 2023.

In the experimental study, the results of three different approaches to the evaluation of the tonality of literary texts are compared: dictionary-based, machine learning, and distributional semantics. The material for analysis was a selection of 210 stories by Russian writers from the first three decades of the 20th century. The research showed that the correlation ...

Added: December 9, 2023

Internet Regulation Media Coverage in Russia: Topics and Countries

Shirokanova A., Silyutina O., , in: WebSci'18 Proceedings of the 10th ACM Conference on Web Science. NY: ACM, 2018. P. 359–363.

Russia first introduced Internet regulation in 2012 with site blockings and then progressed to personal data retention and ban on VPNs. This makes an interesting case because online media had spread and established a parallel political agenda in Russia in the 2000s, before the onset of regulations. The focus of this study is the contents ...

Added: June 1, 2018

Redefining part-of-speech classes with distributional semantic models

Kutuzov A. B., Velldal E., Øvrelid L., , in: Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Berlin: Association for Computational Linguistics, 2016. P. 115–125.

This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The ...

Added: November 12, 2016

Improving Distributional Semantic Models Using Anaphora Resolution during Linguistic Preprocessing

Kutuzov A. B., Козлова О. С., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва,1–4 июля 2016 г.)Вып. 15. М.: Изд-во РГГУ, 2016. P. 288–300.

In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora ...

Added: November 12, 2016

Dark personalities on Facebook: Harmful online behaviors and language

Bogolyubova O., Panicheva P., Tikhonov R. et al., Computers in Human Behavior 2018 Vol. 78 P. 151–159

*Реализация соц. сети Facebook запрещена на территории России по основаниям осуществления экстремистской деятельности. The goal of this paper was to assess the connection between dark personality traits and engagement in harmful online behaviors in a sample of Russian Facebook users, and to describe the language they use in online communication. A total of 6724 individuals participated ...

Added: February 18, 2019

Internet Regulation: A Text-based Approach to Media Coverage

Shirokanova A., Silyutina O., , in: Digital Transformation and Global Society Third International Conference, DTGS 2018, St. Petersburg, Russia, May 30 –June 2, 2018, Revised Selected Papers, Part IIssue 858. Cham: Springer, 2018. P. 181–194.

Internet regulation in Russia has vigorously expanded in recent years to transform the relatively free communication environment of the 2000s into a heavily regulated one. Our goal was to identify the topic structure of Russian media discourse on Internet regulation and compare it between political and non-political media outlets. We used structural topic modeling on ...

Added: October 10, 2018

Интеллектуальный анализ текстов в социальных науках

Byzov A., Социология: методология, методы, математическое моделирование 2019 № 49 С. 131–160

Throughout most of their history, sociologists have sought to study unstructured organic texts: newspaper materials, diaries, memoirs, letters, documents, and, more recently, messages, publications and other texts on various online platforms. This article discusses how modern techniques of text mining can improve classical sociological approaches to the analysis of this type of data. The article ...

Added: December 9, 2019

Моделирование семантических связей в текстах социальных сетей с помощью алгоритма LDA (на материале русскоязычного сегмента Живого Журнала)

Митрофанова О. А., Шиморина А. С., В кн.: Структурная и прикладная лингвистикаКн. 10. СПб.: Издательство Санкт-Петербургского государственного университета, 2014.

Компьютерная обработка корпусов текстов, сформированных на основе социальных сетей, открывает широкие возможности для оперативной оценки не только общественного мнения, но и состояния русскоязычного дискурса, динамики словаря, развития внутриязыковых связей. Цель исследования заключается в том, чтобы 1) осуществить эксперименты по моделированию тематики корпуса текстов Живого Журнала (ЖЖ) Livejournal.ru с помощью программного комплекса TopicMiner, основанного на алгоритме LDA (Latent ...

Added: February 16, 2015

Consumer digital trust: The main trends and research directions

Eduard О. Tunkevichus, Vera A. Rebiazina, Russian Management Journal 2021 Vol. 19 No. 4 P. 429–450

The objective of this paper is to systematize the research on the consumer digital trust from 2016 to 2022 and to determine existing and future research trends. The authors apply a combined systematic and bibliometric review methodology using Scopus metrics and natural language processing techniques using R. Through an analysis of 173 journal publications from ...

Added: April 7, 2023

Automatic construction of lexical typological questionnaires

Paperno D., Ryzhova D., , in: Methodological Tools for Linguistic Description and TypologyIssue 16. University of Hawaii Press, 2019. Ch. 5 P. 45–61.

Questionnaires constitute a crucial tool in linguistic typology and language description. By nature, a Questionnaire is both an instrument and a result of typological work: its purpose is to help the study of a particular phenomenon cross-linguistically or in a particular language, but the creation of a Questionnaire is in its turn based on the ...

Added: August 30, 2019

Исследование дискурса о биопедагогике при помощи тематического моделирования и синтаксического анализа текстов

Nagornyy O. S., Мухетдинова А. Т., В кн.: Математическое и компьютерное моделирование [Электронный ресурс]: материалы IV Международной научной конференции (Омск, 11 ноября 2016 г.). Омск: Издательство Омского государственного университета, 2016. С. 154–156.

В данной работе на материалах раздела о здоровом образе жизни блога lifehacker.ru при помощи тематического моделирования и синтаксического анализа текстов исследуется, как дискурс о биопедагогике проявляет себя в Интернете, какие лингвистические средства для этого используются и какие темы затрагиваются. ...

Added: November 25, 2016

Exploring Semantic Concreteness and Abstractness for Metaphor Identification and Beyond

Badryzlova Y., , in: Компьютерная лингвистика и интеллектуальные технологии: По материалам ежегодной международной конференции «Диалог» (Москва, 17 июня — 20 июня 2020 г.)Вып. 19(26). М.: Изд-во РГГУ, 2020. P. 33–47.

The paper presents a method for computing indexes of semantic concreteness and abstractness in two languages (Russian and English). These indexes are used in metaphor identification experiments in both languages; the results are either comparable to or surpass pervious work and the baselines. We analyze the obtained indexes of concreteness and abstractness to see how ...

Added: August 24, 2020

Сравнение тематических моделей на основе LDA, STM и NMF для качественного анализа русской художественной прозы малой формы

Kirina M., Вестник Новосибирского государственного университета. Серия: Лингвистика и межкультурная коммуникация 2022 Т. 20 № 2 С. 93–109

В статье описываются результаты тематического моделирования малой художественной прозы на основе трех методов – латентного размещения Дирихле (LDA), структурного тематического моделирования (STM) и неотрицательной матричной факторизации (NMF) – в сочетании с разными вариантами предобработки текстов. Апробация экспериментального дизайна осуществляется на материале Корпуса русского рассказа 1900–1930 гг. Исследование позволило выявить особенности рассматриваемых алгоритмов и оценить эффективность ...

Added: December 10, 2022

Что скрывает русский рэп? Тематическое моделирование текстов русскоязычной хип-хоп сцены

Бойченко А. Е., Zhuchkova S., Журнал социологии и социальной антропологии 2020 Т. 23 № 2 С. 130–165

Th e study presents an attempt of the complex exploratory analysis of Russian rap based on the corpus of texts of the Russian-language songs of this genre. Th e corpus contains more than 11,000 texts that vary in their date of creation and popularity by more than 500 artists collected by automatically extracting data from web ...

Added: August 12, 2020

Analyzing Web Presence of Russian Universities in a Scientometric Context

Stanislav Pozdniakov, Musabirov I., Anastasiya Kuznetsova, , in: Digital Transformation & Global Society: Second International Conference, DTGS 2017, St. Petersburg, Russia, June 21-23, 2017, Revised Selected Papers. Springer, 2017. P. 113–119.

In this paper, we analyse the strategies and stratification of Russian universities in the Northwestern region. By enriching traditional social network analysis scientometric tools, we developed web presence indicators focused on the contexts in which universities are linked with businesses and are mentioned in media. We treat resulting groups in terms of Gouldner’s cosmopolitans versus ...

Added: December 11, 2017

Электронные петиции портала Российская общественная инициатива (2013-2017): о чем говорит динамика общественных приоритетов

Porshnev A., Лячина К. Г., Вестник общественного мнения. Данные. Анализ. Дискуссии 2019 Т. 129 № 3-4 С. 103–113

The electronic petition portals that have appeared in the last ten years provide not only a new and very promising channel for political communication and political participation, but also an opportunity to study topics of public interest. The aim of this article was to study the thematic representation of citizens' interests in electronic petitions, their ...

Added: November 30, 2019

Applying Time Series for Background User Identification Based on Their Text Data Analysis

Petrovskiy M., Korolev V., Korchagin A. et al., Programming and Computer Software 2018 Vol. 44 No. 5 P. 353–362

Added: December 5, 2018

Применение тематического моделирования для оптимизации процесса поиска релевантных исторических документов (на примере биржевой прессы начала XX в.)

Галушко И. Н., Историческая информатика. Информационные технологии и математические методы в исторических исследованиях и образовании 2023 № 2 С. 129–144

Ключевой задачей представленной статьи является апробация методики анализа информационного потенциала коллекции исторических источников с помощью тематического моделирования. Некоторые современные коллекции оцифрованных исторических материалов насчитывают десятки тысяч документов, и на уровне отдельного исследователя охват всего доступного наследия представляется затруднительным. Вслед за рядом исследователей мы предполагаем, что тематическое моделирование может стать удобным инструментом предварительной оценки содержания коллекции ...

Added: October 23, 2024

TEXTS OF DIFFERENT EMOTIONAL CLASSES AND THEIR TOPIC MODELING

Kolmogorova A., Qiuhua S., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Vol. 23 No. 5

The article is devoted to studying verbalization specifics of various emotional states in the texts in Russian with the purpose to confirm or refute the hypothesis that texts of different emotional classes reflect the denotative situation not identically, which is reflected in thematic specifics and lexical content. The research material consisted of eight corpus texts ...

Added: November 29, 2024

КОНСТРУИРОВАНИЕ ОБРАЗА ГОРОДА В ОФИЦИАЛЬНОЙ И ОБЫДЕННОЙ КОММУНИКАЦИИ: СРАВНИТЕЛЬНЫЙ АНАЛИЗ (НА МАТЕРИАЛЕ СОЦИАЛЬНЫХ МЕДИА)

Matkin N., Коммуникации. Медиа. Дизайн 2024

The article offers an analysis and visualization of Russian city images that emerge in the comments of urban community subscribers and posts from administrative press services. The city image is regarded as a frame structure that develops through political and interpersonal communication in the network. The social component of the city image is identified as ...

Added: November 15, 2023