Recent Trends in Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020. Revised Supplementary Proceedings
Neural networks are widely used for the task of diacritics restoration last years. Authors use different architectures of neural network for selected languages. In this paper, we demonstrated that an architecture should be selected according to a language in hand. It also depends on a task one states: low and full resourced languages could use different architectures. We demonstrated that common used accuracy metric should be changed in this task to precision and recall due to the heavy unbalanced nature of the input data. The paper contains results for seven languages: Croatian, Slovak, Romanian, French, German, Latvian, and Turkish.
Complexity of software systems is constantly growing, which is even more aggravated by concurrency of processes in systems, so modeling and validating such systems is necessary for detecting and eliminating failures. One of the most well-known formalisms for solving this problem is Petri nets and their extensions such as colored Petri nets, reference nets, and similar models. Many software systems use databases for stor- ing persistent data. However, Petri nets and their mentioned extensions are not designed for modeling persistent data manipulation since these formalisms aim at modeling a control ow of the considered systems. DB-nets, as a novel formalism, aim to solve this problem by providing three following layers: (1) the control layer represented by a colored Petri net with extensions, (2) the data logic layer, which allows to retrieve and update the persistent data, and (3) the persistence layer representing a relational database for storing the persistent data. To date, there are no publicly available software tools that implement simulation of db-net models with reference semantics support. The paper presents a novel software tool for the db-nets simulation. The simulator is developed as a pure plugin for the Renew (Reference Net Workshop) software tool without modifying existing Renew source code. Such an approach for development of the simulator allows to reinforce existing Renew refer- ence semantics. The SQLite embeddable relational DBMS is used as a base tool to implement the db-net persistence layer. In the paper, the- oretical foundations and architecture of the developed simulator are de- scribed. The results of the work can be used in research projects that involve modeling complex software systems with persistent data for both academic and industry-oriented applications.
Event logs of information systems consist of recorded traces, describing executed activities and involved resources (e.g., users, data objects). Conformance checking is a family of process mining techniques that leverage such logs to detect whether observed traces deviate w.r.t some specification model (e.g., a Petri net). In this paper, we present a conformance checking method using colored Petri nets (CPNs) and event logs. CPN models allow not only to specify a causal ordering between system activities, but also they allow to describe how resources must be processed upon activity executions. By replaying each trace of an event log on top of a CPN, we present how this method detects: (1) control-flow deviations due to unavailable resources, (2) rule violations, and (3) differences between modeled and real produced resources. We illustrate in detail our method using the study case of trading systems, where orders from traders must be correctly processed by a platform. We describe experimental evaluations of our method to showcase its practical value.
DaNetQA, a new question-answering corpus, follows BoolQ  design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper we present a reproducible approach to DaNetQA creation and investigate transfer learning methods for task and language transferring. For task transferring we leverage three similar sentence modelling tasks: 1) a corpus of paraphrases, Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3) another question answering task, SberQUAD. For language transferring we use English to Russian translation together with multilingual language fine-tuning.
The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers to identify metaphor with these vectors. We compare the performance of the topic modeling classifiers with other state-of-the-art features (lexical, morphosyntactic, semantic coherence, and concreteness-abstractness) and their different combinations to see how topics contribute to metaphor identification. We show that some of the topics are more frequent in metaphoric contexts while others are more characteristic of non-metaphoric sentences, thus constituting topic predictors of metaphoricity, and discuss whether these predictors align with the conceptual mappings attested in literature. We also compare the topical heterogeneity of metaphoric and non-metaphoric contexts in order to test the hypothesis that metaphoric discourse should display greater topical variability due to the presence of Source and Target domains.The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers to identify metaphor with these vectors. We compare the performance of the topic modeling classifiers with other state-of-the-art features (lexical, morphosyntactic, semantic coherence, and concreteness-abstractness) and their different combinations to see how topics contribute to metaphor identification. We show that some of the topics are more frequent in metaphoric contexts while others are more characteristic of non-metaphoric sentences, thus constituting topic predictors of metaphoricity, and discuss whether these predictors align with the conceptual mappings attested in literature. We also compare the topical heterogeneity of metaphoric and non-metaphoric contexts in order to test the hypothesis that metaphoric discourse should display greater topical variability due to the presence of Source and Target domains.
The article discusses the influence of temperament on the academic performance of the first-year students at HSE-Nizhny Novgorod on the example of the Faculty of Informatics, Mathematics and Computer Science (IM&CS). The analyses were done with the help of statistics and educational data mining. The baseline data for the study is information about students, obtained by a survey: the information about temperament, degree of extraversion, stability, and other personality traits of students. The study involved students of the first and second years of the faculty of the IM&CS 2017-2018 academic year. Further, psychological factors affecting the average score and the probability of re-training for students with different temperaments were identified. A certain connection between temperament and academic success, which makes possible the prediction of "risky" students, was found. Various machine learning methods are used: the kNN-method and decision trees. The best results were shown by decision trees. As a result, first-year students are classified into three groups (Good, Medium, Bad) according to the degree of risk of getting academic debt. The practical result of the research was the recommendations to the educational office of the Faculty of IM&CS to pay attention to risky students and assist them in the educational process. After the end of the summer session, the classification results were checked. The article also presents an algorithm for finding risky students, taking temperament into account.
In the last years, news agencies have become more influential in various social groups. At the same time, the media industry starts to monetize online distributed articles with contextual advertising. However, the efficiency of online marketing highly depends on the popularity of news articles. In our work, we present an alternative and effective way for article popularity forecasting with two–step approach: article keywords extraction and keywords-based article popularity prediction. We show the benefits of this technique and compare with widely used methods, such as Text Embeddings and BERT–based methods. Moreover, the work provides an architecture of the model for dynamic keyword tracking trained on the newest dataset of Russian news articles with more than 280k articles and 22k keywords for the popularity of forecasting purposes.
This article deals with the principles of automatic label assignment for e-hypertext markup. We’ve identified 40 topics that are characteristic of hypertext media, after that, we used an ensemble of two graph-based methods using outer sources for candidate labels generation: candidate labels extraction from Yandex search engine (Labels-Yandex); candidate labels extraction from Wikipedia by operations on word vector representations in Explicit Semantic Analysis (ESA). The results of the algorithms are label’s triplets for each topic, after which we carried out a two-step evaluation procedure of the algorithms’ results: at the first stage, two experts assessed the triplet’s relevance to the topic on a 3-value scale (non-conformity to the topic/partial compliance to the topic/full compliance to the topic), second, we carried out evaluation of single labels by 10 assessors who were asked to mark each label by weights «0» – a label doesn’t match a topic; «1» – a label matches a topic. Our experiments show that in most cases Labels-Yandex algorithm predicts correct labels but frequently relates the topic to a label that is relevant to the current moment, but not to a set of keywords, while Labels-ESA works out labels with generalized content. Thus, a combination of these methods will make it possible to markup e-hypertext topics and create a semantic network theory of e-hypertext.
This article is devoted to the implementation of the federated approach to named entity recognition. The novel federated approach is designed to solve data privacy issues. The classic BiLSTM-CNNs-CRF and its modifications trained on a single machine are taken as baseline. Federated training is conducted for them. Influence of use of pretrained embedding, use of various blocks of architecture on training and quality of final model is considered. Besides, other important questions arising in practice are considered and solved, for example, creation of distributed private dictionaries, selection of base model for federated learning.