?
Detecting Ethnic Conflict in Social Media with Transformers and Augmented Data
Chest X-ray pathology prediction play a very important role in early disease detection, enabling timely intervention and improving patient outcomes. Detection of ethnic conflict mentioning, discussion, or verbal participation therein in user-generated content is a socially important task, as such content has been proven related to ethnic clashes on the ground. Yet this task has not been studied. One of the reasons is the lack of relevant datasets which calls for the usage of data augmentation techniques, still uncommon for NLP. We propose a solution for Russian language by fine-tuning a pretrained transformer encoder enhanced with several standard and novel data augmentation approaches. The highest quality of F1-macro = 0.8 is obtained with fine-tuned ROBERTA model combined with our novel augmentation technique which generates new training data by randomly swapping ethnonyms. This eliminates classification algorithms’ over-reliance on rare ethnonyms and prevents overfitting. Although the contribution of augmentation is modest, when exposed to a relevant adversarial attack, our model turns out to be the most sustainable with its quality advantage over the baseline reaching 0.05 on the target class. This advantage is achieved by training the model on the texts with randomly replaced ethnonyms which eliminates the model’s over-reliance on ethnonyms occurring exclusively or mostly in a single class in the training set. Thus our approach is expected to be useful for elimination of similar effects in the tasks such as aspect-based sentiment analysis with large numbers of aspects. We also conduct error analysis and conclude which categories of texts usually cause inaccurate prediction