Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features

Litvinova T.; Litvinova O.; P. Panicheva

doi:10.1145/3342827.3342834

Publications

?

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features

Ch. 3. P. 9–14.

Litvinova T., Litvinova O., Panicheva P.

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.

Language: English

Full text

DOI

Keywords: Russian language authorship attribution n-gram Extremist forum

In book

NLPIR 2019: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

ACM, 2019.

Detecting Ethnic Conflict in Social Media with Transformers and Augmented Data

Koltsova O., Surkov A., Procedia Computer Science 2025 Vol. 258 P. 2382–2390

Chest X-ray pathology prediction play a very important role in early disease detection, enabling timely intervention and improving patient outcomes. Detection of ethnic conflict mentioning, discussion, or verbal participation therein in user-generated content is a socially important task, as such content has been proven related to ethnic clashes on the ground. Yet this task has not been ...

Added: November 28, 2025

Речевые акты с вежливыми диминутивами: жанровые и дискурсивные особенности

Fufaeva I., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2025 Т. 24 № 4 С. 78–90

This study delves into speech acts utilizing diminutives for politeness, focusing on their discursive and genre-related aspects. It draws on authorial recordings of spoken discourse, data from the National Corpus of the Russian Language, and recordings of urban speech from the 1970s and late twentieth century. The research highlights the potential usage of polite diminutives in ...

Added: November 25, 2025

Интерпретация сложных предложений с разными типами матричных предикатов в контексте отрицания и модальных операторов

Letuchiy A., Russian Linguistics 2025 Т. 49 № 2

The article discusses types of interpretation that Russian complex sentences with factive,implicative and interpretation verbs get under negation and modal operators. By default,the external negative and modal context affects only the main situation. However, one findsexceptions of this rule. We call ‘transparent readings’ those readings in which the exter-nal context affects semantically both the matrix ...

Added: November 5, 2025

Gender stereotypes in agreement processing with role nouns: a study on Russian

Slioussar N., Antropova D., Frontiers in Psychology 2025 Vol. 16 Article 1619505

The majority of Russian nouns denoting professions and social roles are grammatically masculine. Some of them have feminine pairs, the others do not, but in modern Russian, most nouns in this group can be used to refer to women — either with masculine or with feminine agreement. This option has some interesting limitations that have ...

Added: September 22, 2025

Новые номинации мужчин в молодежном сленге

Krongauz M., Труды института русского языка им. В.В. Виноградова 2025 № 3(45) С. 159–167

The article is devoted to modern youth slang, namely to the nominations of men that have appeared most recently: ank, masik, normis, sigma, skuf, tubik, chechik, shtrikh. It is noted that the words masik, tubik, chechik, shtrikh are often discussed together on the Internet and have common semantic and pragmatic characteristics. They denote types of ...

Added: September 17, 2025

Новая количественная модель Платоновского корпуса 2. Филогенетические методы в стилометрии

Alieva O., Вестник Православного Свято-Тихоновского гуманитарного университета. Серия 3: Филология 2025 Т. 84 С. 55–83

Despite the criticism, the standard chronology of Plato’s works continues to hold sway not only over “developmentalists”, but also over various types of “unitarians”. The authority of the standard chronology rests on the confidence that the division of the dialogues into three groups has been “proven” with quantitative methods. In addition to the general theoretical ...

Added: August 28, 2025

Cultural Evaluation of LLMs in Russian: Catchphrases and Cultural Types

Громенко Е. С., Калачева Д. С., Klokova K. et al., , in: Компьютерная лингвистика и интеллектуальные технологии: по материалам ежегодной международной конференции «Диалог» (2025).: [б.и.], 2025.

This study addresses the gap in evaluating large language models' (LLMs) cultural awareness and alignment within the Russian sociocultural context by introducing a structured framework comprising 8 Cultural Types (e.g., Spiritual Practitioner, Soviet Intellectual) and 5 catchphrase groups (e.g., memes, proverbs). A 400-question evalua tion dataset was developed to probe 10 multilingual LLMs, including GPT-4o, ...

Added: May 10, 2025

Контроль в инфинитивной целевой конструкции при глаголах принести и взять в русском языке

Fedorov D., Вопросы языкознания 2025 № 4 С. 77–96

In the article I look at conjunctionless purpose infinitive usages with verbs prinesti ‘bring’ and vz’at’ ‘take’ in the matrix position in Russian. At first, it is unclear whether the expressed object is a dependent of the matrix verb or the embedded verb, and whether the two verbs form a single predicative complex or each ...

Added: April 21, 2025

История идиомы не занимать: реанализ, свернувший с пути

Баркова Л. А., Русский язык в научном освещении 2024 № 2(48) С. 103–128

The article explores the history of an idiom ne zanimat' 'lit. not to borrow' in the context of DCxG. The source of this idiom is the negative matrix clause modal infinitive. This is why the idiom in the earliest contexts was the head of clauses, the syntax of which was identical to the clauses with ...

Added: March 9, 2025

Новый большой сербско-русский словарь (общая концепция и проблемы лексикографического описания)

Драгичевич Р., Королькова М. Д., Ryzhova D. et al., Вопросы лексикографии 2024 № 32 С. 43–60

Added: January 31, 2025

Динамика языковых и культурных процессов в современной России. Выпуск 8. Материалы VIII Конгресса РОПРЯЛ (г. Красноярск, 10–14 сентября 2024 года)

РОПРЯЛ, 2024.

The book includes the texts of reports and scientific presentations of the participants of the VIII Congress of ROPRYAL (Krasnoyarsk, September 10-14, 2024), devoted to topical aspects of the study of Russian language and literature. Special attention is paid to new trends in the description of the Russian language, to the issues of interaction between ...

Added: January 14, 2025

Written vs generated text: “naturalness” as a textual and psycholinguistic category

Kolmogorova A. V, Margolina A. V., Научный результат. Серия: Вопросы теоретической и прикладной лингвистики 2024 Vol. 10 No. 2 P. 71–99

In the context of the development of text generation technologies, the opposition “naturalness − unnaturalness of text” has been transformed into a new dichotomy: “naturalness – artificiality”. The aim of this article is to investigate the phenomenon of naturalness in this context from two perspectives: analyzing the linguistic characteristics of a natural text against a ...

Added: November 29, 2024

TEXTS OF DIFFERENT EMOTIONAL CLASSES AND THEIR TOPIC MODELING

Kolmogorova A., Qiuhua S., Вестник Волгоградского государственного университета. Серия 2: Языкознание 2024 Vol. 23 No. 5 P. 60–71

The article is devoted to studying verbalization specifics of various emotional states in the texts in Russian with the purpose to confirm or refute the hypothesis that texts of different emotional classes reflect the denotative situation not identically, which is reflected in thematic specifics and lexical content. The research material consisted of eight corpus texts ...

Added: November 29, 2024

Бог весть, что черт знает: о развитии конструкций вида X знает в диахронической перспективе

Budennaya E., Litvintseva K., Yakovleva A., Русский язык в научном освещении 2024 № 2(48) С. 31–69

This paper presents a diachronic corpus study of constructions of the X znaet ‘X knows’ type. Such constructions have a long-written history, so revealing the peculiarities of their constructionalization process is of great importance also for the theory of idiomatization. First, we focus on the semantics and compatibility of the construction with the anchor Bog ...

Added: October 30, 2024

Коньячку бы, да до дому: хронология развития некоторых форм второго родительного падежа

Budennaya E., Труды института русского языка им. В.В. Виноградова 2024 № 2(40) С. 261–282

The article based on the material form Russian National Corpus discusses the diachronic development of structures with Russian second genitive case in three types of contexts: 1) with nominal quantifiers; 2) with the preposition bez ‘without’; 3) with the preposition do ‘towards’. The data obtained from Russian language are compared with the data from other languages (Finnic and several Turkic), in which there is a tendency to use the partitive ...

Added: October 4, 2024

ПРАВОВОЕ РЕГУЛИРОВАНИЕ РЕАЛИЗАЦИИ ГОСУДАРСТВЕННОЙ ПОЛИТИКИ СОХРАНЕНИЯ ГРАЖДАНСКОГО ЕДИНСТВА И МЕЖНАЦИОНАЛЬНОГО СОГЛАСИЯ НАРОДОВ

Титор С. Е., Мышко Ф. Г., Шагиева Р. В. et al., М.: Русайнс, 2023.

The study analyzes the current legislation regulating measures for the implementation of national policies aimed at preserving citizenship and national identity. Within the framework of the study, a sociological analysis of the expert opinion of a wide range of people on the issues of solving problems of national policy was carried out. The analysis of ...

Added: August 11, 2024

How does Burrows' Delta work on medieval Chinese poetic texts?

Orekhov B., / Series Computer Science "arxiv.org". 2024.

Burrows' Delta was introduced in 2002 and has proven to be an effective tool for author attribution. Despite the fact that it was applied to different languages, they mostly belong to the same grammatical type and use the same graphic principle to convey speech in writing: a phonemic alphabet with word separation using spaces. The question ...

Added: August 8, 2024

Does Delta really confirm that Rowling and Galbraith are the same author?

Orekhov B., / Series Computer Science "arxiv.org". 2024.

Added: August 8, 2024

Национальный корпус русского языка 2.0: новые возможности и перспективы развития

Савчук С. О., Архангельский Т. А., Bonch-Osmolovskaya A. A. et al., Вопросы языкознания 2024 № 2 С. 7–34

The paper provides an overview of the results of the fundamental reconstruction and modernization project of the National Corpus of the Russian Language platform, carried out from 2020 to 2023. The focus of the paper is on the new opportunities that are opening up for linguists and a wider audience. This includes improving the representativeness ...

Added: March 21, 2024

Concord in Russian close appositional constructions: a quantitative study

Logvinova N., Russian linguistics 2024 Vol. 48 No. 1 Article 4

The paper discusses case concord in Russian appositional constructions, which manifests itself in optional case concord of the proper name (v rek-eLOC Don-eLOC/ v rek-eLOC DonNOM ‘in the river Don’). The study provides an in-depth corpus analysis of more than 15,000 examples, using a logistic regression statistical model to predict the choice between presence and ...

Added: March 17, 2024