• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • A
  • A
  • A
  • A
  • A
Обычная версия сайта
  • RU
  • EN
  • HSE University
  • Publications
  • Articles
  • Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features
  • RU
  • EN
Расширенный поиск
Высшая школа экономики
Национальный исследовательский университет
Priority areas
  • business informatics
  • economics
  • engineering science
  • humanitarian
  • IT and mathematics
  • law
  • management
  • mathematics
  • sociology
  • state and public administration
by year
  • 2027
  • 2026
  • 2025
  • 2024
  • 2023
  • 2022
  • 2021
  • 2020
  • 2019
  • 2018
  • 2017
  • 2016
  • 2015
  • 2014
  • 2013
  • 2012
  • 2011
  • 2010
  • 2009
  • 2008
  • 2007
  • 2006
  • 2005
  • 2004
  • 2003
  • 2002
  • 2001
  • 2000
  • 1999
  • 1998
  • 1997
  • 1996
  • 1995
  • 1994
  • 1993
  • 1992
  • 1991
  • 1990
  • 1989
  • 1988
  • 1987
  • 1986
  • 1985
  • 1984
  • 1983
  • 1982
  • 1981
  • 1980
  • 1979
  • 1978
  • 1977
  • 1976
  • 1975
  • 1974
  • 1973
  • 1972
  • 1971
  • 1970
  • 1969
  • 1968
  • 1967
  • 1966
  • 1965
  • 1964
  • 1963
  • 1958
  • More
Subject
News
June 17, 2026
Population Lifespan Is Governed by Mathematical Laws
Researchers at HSE University and MSU have established a universal law governing the time to extinction of a population in a random environment. Their analysis of the evolution of branching processes—complex probabilistic systems—shows that, regardless of the initial population size, extinction follows strict mathematical laws. The results have been published in the Journal of Applied Probability.
June 16, 2026
Taking Stock Without Euphemisms: Experts Propose Solutions for Russias Foreign and Defence Policy
The recent 34th Assembly of the Council on Foreign and Defence Policy (SVOP) presented analytical approaches to emerging global challenges and developed practical recommendations in the context of a transforming world order. Experts from HSE University took an active part in the sessions and closed briefings.
June 15, 2026
Sociologists: Conservative Consumers Dominate Russian Middle Class
The Russian middle class cannot be regarded as a homogeneous and uniformly stable social group. Similar income levels often mask significant differences in financial strategies, lifestyles, and levels of economic security. This is the conclusion reached by sociologists at HSE University. The study has been published in Voprosy Ekonomiki.

 

Have you spotted a typo?
Highlight it, click Ctrl+Enter and send us a message. Thank you for your help!

Publications
  • Books
  • Articles
  • Chapters of books
  • Working papers
  • Report a publication
  • Research at HSE

?

Enhancing Emotion Recognition in Speech Based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features

IEEE Access. 2026. Vol. 13. P. 56283–56295.
Deeb B., Andrey V. Savchenko, Makarov I.

Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters, which leads to computational complexity. In this paper, we introduce a novel deep-learning model to enhance the accuracy of emotional content detection in speech signals while maintaining a lightweight architecture compared to state-of-the-art models. The proposed model incorporates a feature encoder that significantly improves the emotional representation of acoustic features and a cross-attention mechanism to fuse acoustic features, such as Spectrograms, with semantic features extracted from the pre-trained self-supervised learning framework, enriching the emotional representation of speech. An extensive experimental study demonstrates that the proposed model achieves a weighted accuracy of 74.6% on the IEMOCAP dataset, competitive with the state-of-the-art baselines. In addition, our proposed model achieves a latency of 24 milliseconds on moderate devices while containing up to three times fewer parameters.

Research target: Computer Science
Language: English
Full text
DOI
Text on another site
Keywords: распознавание эмоцийspeech emotion recognitioncross-attention mechanismмеханизм внимания feature fusionобъединение признаков
Similar publications
Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation
Beznosikov A., Kormakov G., Grigorievskiy A. et al., Journal of Optimization Theory and Applications 2026 Vol. 209 Article 18
The objective of Vertical Federated Learning (VFL) is to collectively train a model using features available on different devices while sharing the same users. This paper focuses on the saddle point reformulation of the VFL problem via the classical Lagrangian function. We first demonstrate how this formulation can be solved using deterministic methods.More importantly, we explore various stochastic modifications to ...
Added: June 17, 2026
Supervised Learning in Critical Phenomena—Statistical and Systematic Accuracy
Chertenkov V. I., Shchur L., Lobachevskii Journal of Mathematics 2026 Vol. 47 No. 2 P. 720–727
Supervised machine learning is successfully applied to the study of critical phenomena and allows us to obtain a numerical estimate of the phase transition temperature and the correlation length exponent. We discuss the influence of possible systematic errors, as well as statistical errors, on the accuracy of such numerical estimates. Errors in the training and ...
Added: June 16, 2026
Automated detection of wolf howls using audio spectrogram transformers
Makarov N., Savchenko A., Zemtsova I. et al., Scientific Reports 2025 Vol. 15 Article 26641
The grey wolf (Canis lupus) is a pivotal species for ecological studies. As a key participant in ecosystem processes, it also serves as a model for investigating social structure formation and ecological adaptation. However, the species’ complex social behavior, spatial dynamics, and expansive habitats make monitoring and population assessments across large areas particularly challenging. In recent years, audio traps ...
Added: June 16, 2026
Artificial intelligence framework for multi-pathology risk assessment from retinal fundus images: deep learning approach to 15-disease screening
Vasilev R., Savchenko A., Blinov P. et al., Frontiers in Medicine 2026 Vol. 13
Automated disease screening systems face challenges when applied to multi-class medical image analysis, particularly under severe class imbalance inherent in clinical datasets. Retinal fundus imaging enables non-invasive screening for multiple ocular and systemic diseases simultaneously, yet current automated systems typically assess risk for only a single pathology or a limited disease range. We developed a ...
Added: June 16, 2026
From Data to Signs: A Foundation Model for Multilingual Sign Language Recognition
Novopoltsev M., Tulenkov A., Murtazin R. et al., IEEE Access 2025 Vol. 13 P. 188170–188181
Video-based Isolated Sign Language Recognition (ISLR) problem presents significant challenges in scaling across diverse languages due to data scarcity and the computational costs associated with training of language-specific models. In this paper, we introduce a novel training pipeline that leverages self-supervised learning on a large-scale sign language dataset. To obtain the foundation model, we utilize ...
Added: June 16, 2026
B3Emo: Quantifying Affect as a Double-Edged Sword in Strategic LLM Interactions
Stepin A., Mozikov M., Kabanov A. et al., IEEE Access 2026 Vol. 14 P. 48127–48144
The deployment of large language models (LLMs) in interactive roles such as automated negotiators, customer service agents, and strategic partners requires them to handle not only logical tasks but also the socio-emotional dimensions of interaction. In these situations, success often relies on understanding social cues, building trust, and using persuasion effectively. These skills are closely ...
Added: June 16, 2026
ESQA: Event Sequences Question Answering
Abdullaeva I., Karpukhin I., Filatov A. et al., IEEE Access 2026 Vol. 14 P. 59390–59408
Event sequences, a specialized type of tabular data annotated with timestamps, are prevalent across practical domains such as finance, retail, social networks, and healthcare. Despite the importance of event sequence modeling and analysis, there has been little effort to adapt Large Language Models (LLMs) to this domain. In this paper, we propose a novel solution ...
Added: June 16, 2026
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Association for Computational Linguistics, 2026.
Added: June 14, 2026
Proceedings of the 6th Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences (CODI 2025)
Strube M., Braud C., Hardmeier C. et al., Suzhou: Association for Computational Linguistics, 2025.
Added: June 11, 2026
TreeDQN: Sample-efficient off-policy reinforcement learning for combinatorial optimization
Sorokin D., Kostin A., Savchenko L. et al., Knowledge-Based Systems 2026 Vol. 348 Article 116258
A convenient approach to optimally solving combinatorial optimization tasks is the Branch-and-Bound method. Its branching heuristic can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning method based on the tree Markov Decision Process. To overcome its main disadvantages, namely, very large training time ...
Added: June 10, 2026
Microbial diversity and production of milk spirit using traditional Buryat fermentation and distillation technologies
Namsaraev Z., Nanzatov B., Kozlova A. et al., Scientific Reports 2026 Vol. 16 No. 1 Article 17769
Distilled fermented milk beverages are rare in food technology, despite the global prevalence of plant-based spirits. Currently, the production of distilled strong alcoholic beverages from fermented milk using traditional technologies is known only among Mongolic-speaking peoples and their Siberian neighbors. This study provides the first interdisciplinary analysis of darasun, a traditional Buryat spirit made from fermented ...
Added: June 10, 2026
Artificial intelligence and digital twins for failure prediction in data center cooling systems: a comprehensive literature review (2018–2026)
Butorova A., Bobakov V., Sergeev A. et al., European Physical Journal: Special Topics 2026 P. 1–19
This paper presents a review of artificial intelligence (AI) methods for failure prediction in data center cooling systems, with a focus on the integration of digital twins (DTs), physics-informed learning, and graph-based models. Positioned within complex network science, this review addresses a limitation of conventional graph approaches—their reliance on pairwise connectivity—whereas real-world failures often arise ...
Added: June 10, 2026
Innovations in Information and Decision Sciences. Proceedings of the 13th International Conference on Frontiers in Intelligent Computing: Theory and Applications (FICTA 2025), Volume 4
Springer, 2026.
The book presents the proceedings of the 13th International Conference on Frontiers of Intelligent Computing: Theory and Applications (FICTA 2024), held at Intelligent Systems Research Group (ISRG), London Metropolitan University, London, United Kingdom, during June 6–7, 2025. Researchers, scientists, engineers and practitioners exchange new ideas and experiences in the domain of intelligent computing theories with ...
Added: June 8, 2026
Метод распознавания сентимента и эмоций в транскрипциях русскоязычной речи с использованием машинного перевода
Dvoynikova A., Кагиров И. А., Карпов А. А., Информатика и автоматизация (Труды СПИИРАН) 2024
В статье рассматривается проблема распознавания сентимента и эмоций пользователей в русскоязычных текстовых транскрипциях речи с использованием словарных методов и машинного перевода. Количество имеющихся информационных ресурсов для анализа сентимента текстовых сообщений на русском языке очень ограничено, что существенно затрудняет применение базовых методов анализа сентимента, а именно, предобработки текстов, векторизации с помощью тональных словарей, традиционных классификаторов. Для ...
Added: April 25, 2026
Аналитический обзор многомодальных корпусов данных для распознавания эмоций
Dvoynikova A., В кн.: Альманах научных работ молодых ученых Университета ИТМО.: Университет ИТМО, 2023.
В статье раскрываются достоинства и недостатки категориальных и пространственных моделей описания эмоций. Пространственные модели позволяют охватить более широкий спектр человеческих эмоций, что позволяет разработать наиболее эффективную систему распознавания эмоций. В работе проводится аналитический обзор существующих многомодальных корпусов данных, которые имеют разметку по валентности и интенсивности эмоций. В заключении выделяется наиболее репрезентативный корпус данных для автоматического ...
Added: April 25, 2026
Подход к автоматическому распознаванию эмоций в транскрипциях речи
Dvoynikova A., Кондратенко К. О., Известия высших учебных заведений. Приборостроение 2023 Т. 66 № 10 С. 818–827
Аннотация. Исследован актуальный в различных областях вопрос распознавания эмоций в транскрипциях речи. Проанализировано влияние методов предобработки (удаление стоп-слов, лемматизация, стемминг) на точность распознавания эмоций в текстовых данных на русском и английском языках. Для проведения экспериментальных исследований использовались орфографические транскрипции диалогов из многомодальных корпусов RAMAS и CMU-MOSEI на русском и английском языке соответственно. Аннотирование этих корпусов ...
Added: April 25, 2026
Автоматическое определение эмоционального состояния участников предметных разговоров по транскрипциям речи
Dvoynikova A., Мамонтов Д. Ю., Карпов А. А., В кн.: Альманах научных работ молодых ученых Университета ИТМОТ. 3.: Университет ИТМО, 2021. С. 63–68.
В работе проводятся экспериментальные исследования по определению уровня эмоциональных проявлений в текстовых транскрипциях базы данных K-EmoCon. Рассматривается влияние сбалансирования классов при обучении классификаторов на точность определения эмоций. В статье устанавливается базовый стандарт результатов по классификации уровня эмоций дикторов в текстовых транскрипциях. ...
Added: April 24, 2026
A Bimodal Approach for Speech Emotion Recognition using Audio and Text
Verkholyak O., Dvoynikova A., Karpov A., Journal of Internet Services and Information Security 2021 No. 1 P. 80–96
This paper presents a novel bimodal speech emotion recognition system based on analysis of acoustic and linguistic information. We propose a novel decision-level fusion strategy that leverages both emotions and sentiments extracted from audio and text transcriptions of extemporaneous speech utterances. We perform experimental study to prove the effectiveness of the proposed methods using emotional ...
Added: April 24, 2026
Метод улучшения обнаружения атак презентации на биометрическую систему распознавания лиц с помощью сверточной сети с механизмом внимания
Pikul A. S., В кн.: Альманах научных работ молодых ученых университета ИТМО. Материалы Пятьдесят третьей (LIII) научной и учебно-методической конференции Том 1.: СПб.: Университет ИТМО, 2024. С. 338–342.
Предложен новый подход для улучшения распознавания атак презентации на биометрическую систему распознавания лиц с помощью сверточной сети с механизмом внимания. Проверена центральная гипотеза, которая заключалась в том, что с помощью механизма внимания возможно улучшить результаты работы исходной сверточной нейронной сети. В ходе экспериментов гипотеза была подтверждена. Наибольший прирост по качеству был достигнут на наборе данных ...
Added: December 13, 2025
Ансамбль современных моделей компьютерного зрения для задачи обнаружения дипфейков
Pikul A. S., Безопасность информационных технологий 2024 Т. 31 № 4 С. 116–127
This article explores the potential use of modern computer vision architectures for the task of deepfake detection. The following architectures are considered: EfficientNet, Vision Transformer (ViT), VisionLSTM (ViL), Vision KAN, and Mamba Vision. The novelty of the approach lies in the application and comparison of these architectures, as well as their combination into paired ensembles ...
Added: December 12, 2025
CA-SER: Cross-Attention Feature Fusion for Speech Emotion Recognition
Deeb B., Savchenko A., Makarov I., , in: ECAI 2024. 27th European Conference on Artificial Intelligence, October 19 – 24 October 2024, Santiago de Compostela, Spain – Including 13th Conference on Prestigious Applications of Intelligent Systems (PAIS 2024).: IOS Press, 2024. P. 4479–4482.
In this paper, we introduce a novel tool for speech emotion recognition, CA-SER, that borrows self-supervised learning to extract semantic speech representations from a pre-trained wav2vec 2.0 model and combine them with spectral audio features to improve speech emotion recognition. Our approach involves a self-attention encoder on MFCC features to capture meaningful patterns in audio ...
Added: February 15, 2025
Неклассический подход к созданию базы эмоциональных лиц: за рамками теории базовых эмоций
Petrakova A., Anikudimova E., Лебедева Е. И., В кн.: Лицо человека в системах коммуникации.: М.: Московский институт психоанализа, 2024. Гл. 10 С. 138–147.
Added: January 7, 2025
Опыт создания российской базы лиц, изображающих различные эмоции: первый этап
Petrakova A., Лебедева Е. И., Kuzmina Y. et al., Психология. Журнал Высшей школы экономики 2024 Т. 21 № 2 С. 423–431
This article presents a pilot study with the objective to create and test stimulus material, which consists of photographic portraits of adults and children expressing various emotions. The uniqueness of this work is due to the approach to organizing the creation of stimulus material, in which the models demonstrated emotions not according to an established ...
Added: December 26, 2024
Распознавание эмоций в соотнесении к «эмоциональным семействам»
Petrakova A., Лебедева Е. И., Anikudimova E., Экспериментальная психология 2024 Т. 17 № 3 С. 4–15
The work is aimed at studying the performance of emotion recognition of people of different sex and age, expressed without specified criteria, in association with «emotional families». The materials of an empirical online research obtained with the help of the crowdsourcing service «Yandex. Toloka», in which 3,590 tes- ters took part. The subjects guessed one ...
Added: December 26, 2024
  • About
  • About
  • Key Figures & Facts
  • Sustainability at HSE University
  • Faculties & Departments
  • International Partnerships
  • Faculty & Staff
  • HSE Buildings
  • HSE University for Persons with Disabilities
  • Public Enquiries
  • Studies
  • Admissions
  • Programme Catalogue
  • Undergraduate
  • Graduate
  • Exchange Programmes
  • Summer University
  • Summer Schools
  • Semester in Moscow
  • Business Internship
  • Research
  • International Laboratories
  • Research Centres
  • Research Projects
  • Monitoring Studies
  • Conferences & Seminars
  • Academic Jobs
  • Yasin (April) International Academic Conference on Economic and Social Development
  • Media & Resources
  • Publications by staff
  • HSE Journals
  • Publishing House
  • iq.hse.ru: commentary by HSE experts
  • Library
  • Economic & Social Data Archive
  • Video
  • HSE Repository of Socio-Economic Information
  • HSE1993–2026
  • Contacts
  • Copyright
  • Privacy Policy
  • Site Map
Edit