Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language

Sergei Koltcov; A. Surkov; O. Koltsova; V. Ignatenko

doi:10.7717/peerj-cs.2395

Publications

?

Using large language models for extracting and pre-annotating texts on mental health from noisy data in a low-resource language

PeerJ Computer Science, США. 2024. Vol. 10. Article e2395 .

Sergei Koltcov, Surkov A., Koltsova O., Ignatenko V.

Recent advancements in large language models (LLMs) have opened new possibilities for developing conversational agents (CAs) in various subfields of mental healthcare. However, this progress is hindered by limited access to high-quality training data, often due to privacy concerns and high annotation costs for low-resource languages. A potential solution is to create human-AI annotation systems that utilize extensive public domain user-to-user and user-to-professional discussions on social media. These discussions, however, are extremely noisy, necessitating the adaptation of LLMs for fully automatic cleaning and pre-classification to reduce human annotation effort. To date, research on LLM-based annotation in the mental health domain is extremely scarce. In this article, we explore the potential of zero-shot classification using four LLMs to select and pre-classify texts into topics representing psychiatric disorders, in order to facilitate the future development of CAs for disorder-specific counseling. We use 64,404 Russian-language texts from online discussion threads labeled with seven most commonly discussed disorders: depression, neurosis, paranoia, anxiety disorder, bipolar disorder, obsessive-compulsive disorder, and borderline personality disorder. Our research shows that while preliminary data filtering using zero-shot technology slightly improves classification, LLM fine-tuning makes a far larger contribution to its quality. Both standard and natural language inference (NLI) modes of fine-tuning increase classification accuracy by more than three times compared to non-fine-tuned training with preliminarily filtered data. Although NLI fine-tuning achieves slightly higher accuracy (0.64) than the standard approach, it is six times slower, indicating a need for further experimentation with NLI hypothesis engineering. Additionally, we demonstrate that lemmatization does not affect classification quality and that multilingual models using texts in their original language perform slightly better than English-only models using automatically translated texts. Finally, we introduce our dataset and model as the first openly available Russian-language resource for developing conversational agents in the domain of mental health counseling.

Research target: Computer Science Psychology

Keywords: natural language inference large language model (LLM)Большие языковые модели (LLMs)Zero shot classification Psychological text data логический вывод на естественном языке текстовые психологические данные

Publication based on the results of:

Modelling information and communication behaviour in computer-mediated environments and improving algorithms for behavioural data analysis (2024)

Мотивация использования искусственного интеллекта: адаптация диагностического инструментария

Volkova N., Кочетков Н. В., Чикер В. А., Психологическая наука и образование 2026 Т. 31 № 3 С. 35–49

Context and relevance. Artificial intelligence is a technology with the potential to fundamentally transform all spheres of human life. Its rapid integration into everyday reality intensifies research dedicated to the psychology of using neural networks. However, the development of empirical research in the Russian scientific field is limited by the lack of validated psychodiagnostic tools that ...

Added: July 5, 2026

Журнал Телекоммуникации №1 за 2026

М.: Наука и технологии, 2026.

«Телекоммуникации» ежемесячный рецензируемый производственный, информационно-аналитический и учебно-методический журнал выходит в свет с июля 2000 г. Для руководителей и работников промышленности, научно-исследовательских и проектно-конструкторских институтов, высших учебных заведений, аспирантов и студентов, а также для специалистов, разрабатывающих, выпускающих и эксплуатирующих средства телекоммуникаций. Новости разработок и производства, прогнозы развития, защита информации, Нормативные, справочные, аналитические и учебно-методические материалы. Переход к глобальному информационному ...

Added: July 4, 2026

"Труды МФТИ" Том 17, № 4 (68) (2025)

МФТИ, 2025.

абота редакции научного журнала «Труды Московского физико-технического института» (кратко «Труды МФТИ»), редакционной коллегии и редакционного совета осуществляется в соответствии с Положением, утвержденным ректором института. В состав редакционной коллегии входят руководители института, факультетов, институтских и факультетских кафедр. Главный редактор журнала —президент МФТИ, член-корр. РАН Кудрявцев Н.Н. Журнал «Труды МФТИ» входит в базу данных РИНЦ (Российский Индекс Научного Цитирования) и доступен в электронной ...

Added: July 4, 2026

Диалектика иметь и быть в психоаналитическом подходе: от истерии до психосоматики

Хилинская О. С., Leykina A., Журнал клинического и прикладного психоанализа 2026 Т. VII № 2 С. 104–121

Questions of the interrelation of manifestations played out on the psychic stage and at the level of the body have been the object of research in both psychiatry and psychoanalytic psychopathology for a long time, since the time of Hippocrates. With the advent of psychoanalysis, a revolution took place in the fi eld of psychosomatics, Freud brought something that would ...

Added: July 4, 2026

Modulation Recognition for Industrial Internet of Things Communication Signals Under Few-Shot Conditions Based on Attention Mechanism and Relation Network

Hualin M., Jie Z., Jerome Y. et al., Journal of Internet Technology 2026 Vol. 27 No. 3 P. 367–382

In open, interference-prone scenarios, the scarcity of precisely annotated signal samples limits the application of deep learning–based modulation identification, which generally relies on extensive labeled data for stability. Relation Networks, as an emerging class of deep learning models, exhibit rapid convergence in few-shot learning tasks. Motivated by the fast convergence property of relation-based learning and ...

Added: July 3, 2026

Кодовые конструкции на базе обобщенных каскадных кодов для систем связи, использующих прием на основе порядковых статистик

Osipov D., Информационно-управляющие системы 2026 № 3 С. 49–62

Introduction: In many communication systems under construction and those to be created power control and channel estimation techniques developed for the previous generation communication systems fail to provide desired precision. One way to solve this problem is to use order-statistics-based reception techniques that do not need channel estimation or power control. To ensure the desired ...

Added: July 3, 2026

Men and women are from the same planet Gender similarities in perspective-taking abilities

Imbault C., Slioussar N., Ivanenko A. et al., The Mental Lexicon 2026 P. 1–23

The study examines emotional responses to words representing a wide range of psychological valence and focuses on gender-related differences. We aimed to find out whether men and women differ in their emotional responses, and whether they can take the perspective of another gender. We used the slider paradigm (Warriner et al., 2017): participants saw a humanoid ...

Added: July 2, 2026

Возможности графической методики «Траектория» для диагностики динамики жизненного пути личности

Shilmanskaya A., Leontiev D., Культурно-историческая психология 2026 Т. 22 № 2 С. 86–97

Context and relevance. The issue of personality changes has become a significant trend in personality research over the past decade. Traditional approaches to assessing the effectiveness of psychotherapeutic work are based on comparing measurement results before and after intervention. Objective. To test the visual “Trajectory” method for assessing the perceived trajectory of individual development and to verify its ...

Added: July 1, 2026

Представления об изменении ресурсов у населения России после первых трех месяцев пандемии COVID-19

Васильчук М. С., Шаньков Ф. М., Chumakova M. et al., Psychology. Journal of the Higher School of Economics 2021 Vol. 18 No. 2 P. 247–258

The 2019 Coronavirus disease outbreak leads to negative psychological outcomes not only for healthcare workers and patients, but also for the general public. S. Hobfoll’s Conservation of Resources theory is one of the most applicable models for conceptualizing and evaluating natural and social catastrophes and their impact. A web-based screening has been conducted at the ...

Added: June 30, 2026

Категориальное научение у детей с РАС: систематический обзор

Luzhnova K., Психологические исследования: электронный научный журнал 2025 Vol. 19 No. 107 P. 1–18

This systematic review aims to synthesize and analyze current research on the characteristics of category learning in children with autism spectrum disorder (ASD) within the framework of the COVIS model, which posits competition between explicit (verbal) and implicit (nonverbal) categorization systems. The review includes 40 empirical studies published between 1981 and 2025, selected according to ...

Added: June 30, 2026

Теоретические и методологические основы когнитивно-поведенческого коучинга как научно обоснованной помогающей практики

Antonova N., Федулова Е. В., Психологические исследования: электронный научный журнал 2026 Т. 19 № 107 С.

The aim of this article is to analyze the theoretical and methodological foundations of cognitive-behavioral coaching and to identify prospects for its further development and research. The article examines the historical prerequisites for the emergence of cognitive-behavioral coaching as a helping practice. The distinctions between cognitive-behavioral coaching (CBC) and cognitive-behavioral therapy (CBT) are analyzed. The ...

Added: June 29, 2026

Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)

Buzaev F., Mullakhmetov R., Bogachev R. et al., Association for Computational Linguistics, 2026.

Playlist generation based on textual queries using large language models (LLMs) is becoming an important interaction paradigm for music streaming platforms. User queries span a wide spectrum from highly personalized intent to essentially catalog-style requests. Existing systems typically rely on non-personalized retrieval/ranking or apply a fixed level of preference conditioning to every query, which can ...

Added: June 22, 2026

Benchmarking DNA large language models on quadruplexes

Cherednichenko O., Herbert A., Poptsova M., Computational and Structural Biotechnology Journal 2025 Vol. 27 P. 992–1000

Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and state-space models ...

Added: June 19, 2026

Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation

Severin N., Kartushov D., Urzhumov V. et al., , in: Advances in Information Retrieval: 48th European Conference on Information Retrieval, ECIR 2026, Delft, The Netherlands, March 29 – April 2, 2026, Proceedings, Part II. (LNCS, volume 16484).: Cham: Springer Publishing Company, 2026. P. 508–517.

Sequential recommender systems have achieved significant success in modeling temporal user behavior but remain limited in cap-turing rich user semantics beyond interaction patterns. Large Language Models (LLMs) present opportunities to enhance user understanding with their reasoning capabilities, yet existing integration approaches cre-ate prohibitive inference costs in real time. To address these limitations, we present a ...

Added: June 18, 2026

ESQA: Event Sequences Question Answering

Abdullaeva I., Karpukhin I., Filatov A. et al., IEEE Access 2026 Vol. 14 P. 59390–59408

Event sequences, a specialized type of tabular data annotated with timestamps, are prevalent across practical domains such as finance, retail, social networks, and healthcare. Despite the importance of event sequence modeling and analysis, there has been little effort to adapt Large Language Models (LLMs) to this domain. In this paper, we propose a novel solution ...

Added: June 16, 2026

LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters

Vladimir Bogachev, Aletov V., Alexander Molozhavenko et al., , in: The Fourteenth International Conference on Learning Representations (ICLR 2026).: ICLR, 2026. Ch. 20503 P. 1–26.

This work presents a novel, fully Riemannian framework for Low-Rank Adaptation (LoRA) that geometrically treats low-rank adapters by optimizing them directly on the fixed-rank manifold. This formulation eliminates the parametrization ambiguity present in standard Euclidean optimizers. Our framework integrates three key components to achieve this: (1) we derive Riemannion, a new Riemannian optimizer on the fixed-rank ...

Added: April 29, 2026

Bridging the Semantic Gap in Metadata Management using Large Language Models

Сулейкин А. С., Сорокина В., Пятецкий В. Е., , in: 2025 7th International Conference on Control Systems, Mathematical Modeling, Automation and Energy Efficiency.: [б.и.], 2025. P. 748–753.

Effective metadata management is fundamental to data governance, ensuring that data assets are discoverable, understandable, and usable across the enterprise. However, traditional metadata systems often remain purely technical, describing structures without conveying business meaning. This disconnect — known as the semantic gap — limits the interpretability and value of metadata for business users. To address ...

Added: April 17, 2026

XXII национальная конференция по искусственному интеллекту с международным участием (КИИ-2025)

СПб.: Санкт-Петербургский Федеральный исследовательский центр РАН, 2025.

Двадцать вторая Национальная конференция по искусственному интеллекту с международным участием КИИ-2025 продолжает традицию советских (российских) конференций, организуемых Российской ассоциацией искусственного интеллекта. В первом томе трудов публикуются пленарные доклады и доклады участников конференции, представленные на следующих секциях: Секция 1 «Инженерия знаний», Секция 2 «Интеллектуальный анализ данных», Секция 3 «Моделирование рассуждений», Секция 4 «Интеллектуальный анализ текстов, большие ...

Added: February 15, 2026

Generating and Debugging Java Code using LLMs based on Associative Recurrent Memory

Василевский В. И., Alexandrov D., Proceedings of the Institute for System Programming of the RAS 2025 Vol. 37 No. 5 P. 173–182

Automatic code generation by large language models (LLMs) has achieved significant success, yet it still faces challenges when dealing with complex and large codebases, especially in languages like Java. The limitations of LLM context windows and the complexity of debugging generated code are key obstacles. This paper presents an approach aimed at improving Java code generation and debugging. ...

Added: December 26, 2025

Разработка и интеграция AI-ассистента в систему управления обучением.

Караваева Е. А., Василевский В. И., Ланин Г. М. et al., Труды Института системного программирования РАН 2025 Т. 37 № 4 С. 175–190

The ongoing digitalization of education requires new ways of presenting information and attention retention mechanisms. The aim of the presented work is to propose a solution for implementing a large language model, which will interactively generate prompts of different types, within an e-learning course on programming. The main approaches are the analysis of existing relatively ...

Added: December 25, 2025

Prediction of protein-protein interactions using point transformer and spherical Convex Hull graphs

David Arteaga, Poptsova M., Computational and Structural Biotechnology Journal 2026 Vol. 31 P. 82–93

Accurate predictions and large-scale identification of protein-protein interactions (PPIs) are crucial for understanding their inherent biological mechanisms and protein functions in virtually all biological processes. Nowadays, graph-based deep learning models have made significant contributions in modeling proteins with physicochemical and geometric features. However, most of these models rely on conventional graph construction methods, such as ...

Added: December 22, 2025

Искусственный интеллект как симулякр смысла

Малинов С. А., Галактика медиа: журнал медиа исследований 2025 Т. 7 № 4 С. 154–173

In recent years, artificial intelligence (AI) has been actively integrated into everyday human life. Its popularity continues to grow steadily, and companies increasingly employ AI to optimize and accelerate workflows. Ordinary users leverage large language models (LLMs) and multimodal AI systems to perform a wide range of tasks, including generating texts, images, and videos; planning ...

Added: December 7, 2025

SIGNAL: Dataset for Semantic and Inferred Grammar Neurological Analysis of Language

Komissarenko A., Voloshina E., Чевелева А. Н. et al., Scientific data 2025 Vol. 12 No. 1 Article 1687

Recently, the idea of brain-model alignment has been the topic of several influential works. However, most of previous studies were based on datasets collected during regular reading tasks where the subjects were not exposed to processing linguistic incongruencies, and stimuli were not controlled for key linguistic properties. Meanwhile, interpretability studies of Large Language Models pay ...

Added: November 18, 2025

MADD: Multi-Agent Drug Discovery Orchestra

Solovev G. V., Zhidkovskaya A. B., Orlova A. et al., , in: Findings of the Association for Computational Linguistics: EMNLP 2025.: Association for Computational Linguistics, 2025. Ch. 367 P. 6956–6998.

Hit identification is a central challenge in early drug discovery, traditionally requiring substantial experimental resources. Recent advances in artificial intelligence, particularly large language models (LLMs), have enabled virtual screening methods that reduce costs and improve efficiency. However, the growing complexity of these tools has limited their accessibility to wet-lab researchers. Multi-agent systems offer a promising ...

Added: November 16, 2025