Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2020)
In this paper, we present a shared task on core information extraction prob- lems, named entity recognition and relation extraction. In contrast to popular shared tasks on related problems, we try to move away from strictly aca- demic rigor and rather model a business case. As a source for textual data we choose the corpus of Russian strategic documents, which we annotated according to our own annotation scheme. To speed up the annotation pro- cess, we exploit various active learning techniques. In total we ended up with more than two hundred annotated documents. Thus we managed to cre- ate a high-quality data set in short time. The shared task consisted of three tracks, devoted to 1) named entity recognition, 2) relation extraction and 3) joint named entity recognition and relation extraction. We provided with the annotated texts as well as a set of unannotated texts, which could of been used in any way to improve solutions. In the paper we overview and compare solutions, submitted by the shared task participants. We release both raw and annotated corpora along with annotation guidelines, evaluation scripts and results at https://github.com/dialogue-evaluation/RuREBus.
We present the ShiftRy web service. It helps to analyze temporal changes in the usage of words in news texts from Russian mass media. For that, we employ diachronic word embedding models trained on large Russian news corpora from 2010 up to 2019. The users can explore the usage history of any given query word, or browse the lists of words ranked by the degree of their semantic drift in any couple of years. Visualizations of the words’ tra-jectories through time are provided. Importantly, users can obtain corpus examples with the query word before and after the semantic shift (if any). The aim of ShiftRy is to ease the task of studying word history on short-term time spans, and the influence of social and political events on word usage. The service will be updated with new data yearly.
Expert-built lexical resources are known to provide information of good quality for the cost of low coverage. This property limits their applicability in modern NLP applications. Building descriptions of lexical-semantic relations manually in sufficient volume requires a huge amount of qualified human labour. However, given some initial version of a taxonomy is already built, automatic or semi-automatic taxonomy enrichment systems can greatly reduce the required efforts. We propose and experiment with two approaches to taxonomy enrichment, one utilizing information from word definitions and another from word usages, and also a combination of them. The first method retrieves co-hyponyms for the target word from distributional semantic models (word2vec) or language models (XLM-R), then looks for hypernyms of co-hyponyms in the taxonomy. The second method tries to extract hypernyms directly from Wiktionary definitions. The proposed methods were evaluated on the Dialogue-2020 shared task on taxonomy enrichment. We found that predicting hypernyms of cohyponyms achieves better results in this task. The combination of both methods improves results further and is among 3 best-performing systems for verbs. An important part of the work is detailed qualitative and error analysis of the proposed methods, which provide interesting observations of their behaviour and ideas for the future work.
In this work we present our system for RuREBus shared task held together with Dialog 2020 conference. The task consisted of 3 subtasks: named entity recognition, relation extraction with provided named entity tags and end-to-end relation extraction. Our system took the first and the second place in the first and the second subtasks respectively. For the third subtask we submitted our solution only in the post-evaluation phase, however, it was among the top 2 best performing systems. The systems for all tasks are based on Transformer models. Relation extraction was solved as a sequence labelling problem. We also used joint task named entity and relation extraction learning.
Currently, social network sites tend to be one of the major communication platforms in both offline and online space. Freedom of expression of various points of view, including toxic, aggressive, and abusive comments, might have a long-term negative impact on people’s opinions and social cohesion. As a consequence, the ability to automatically identify and moderate toxic content on the Internet to eliminate the negative consequences is one of the necessary tasks for modern society. This paper aims at the automatic detection of toxic comments in the Russian language. As a source of data, we utilized anonymously published Kaggle dataset and additionally validated its annotation quality. To build a classification model, we performed fine-tuning of two versions of Multilingual Universal Sentence Encoder, Bidirectional Encoder Representations from Transformers, and ruBERT. Finetuned RuBERT achieved F1 = 92.20%, demonstrating the best classification score. We made trained models and code samples publicly available to the research community.