?
Bimodal sentiment and emotion classification with multi-head attention fusion of acoustic and linguistic information
This article describes solutions to several problems: CMU-MOSEI database preprocessing to improve data quality and bimodal multitask classification of emotions and sentiment. With the help of experimental studies, representative features for acoustic and linguistic information are identified among pretrained neural networks with Transformer architecture. The most representative features for the analysis of emotions and sentiments are EmotionHuBERT and RoBERTa for audio and text modality, respectively. The article establishes a baseline for bimodal multitask recognition of sentiment and emotions – 61.85% and 59.88% macro F-score, respectively. Experiments are also being conducted with different approaches to combining modalities – concatenation and multi-head attention. The most effective architecture of the system with early concatenation of audio and text modality and late multi-head attention for emotion and sentiment recognition is proposed. Using the proposed approach, 61.10% and 60.76% F-scores are achieved on bimodal (audio and text) multitasking recognition of 3 classes of sentiment and 6 binary classes of emotions