Book
Recent Trends in Analysis of Images, Social Networks and Texts. 9th International Conference, AIST 2020. Revised Supplementary Proceedings
Neural networks are widely used for the task of diacritics restoration last years. Authors use different architectures of neural network for selected languages. In this paper, we demonstrated that an architecture should be selected according to a language in hand. It also depends on a task one states: low and full resourced languages could use different architectures. We demonstrated that common used accuracy metric should be changed in this task to precision and recall due to the heavy unbalanced nature of the input data. The paper contains results for seven languages: Croatian, Slovak, Romanian, French, German, Latvian, and Turkish.
Complexity of software systems is constantly growing, which is even more aggravated by concurrency of processes in systems, so modeling and validating such systems is necessary for detecting and eliminating failures. One of the most well-known formalisms for solving this problem is Petri nets and their extensions such as colored Petri nets, reference nets, and similar models. Many software systems use databases for stor- ing persistent data. However, Petri nets and their mentioned extensions are not designed for modeling persistent data manipulation since these formalisms aim at modeling a control ow of the considered systems. DB-nets, as a novel formalism, aim to solve this problem by providing three following layers: (1) the control layer represented by a colored Petri net with extensions, (2) the data logic layer, which allows to retrieve and update the persistent data, and (3) the persistence layer representing a relational database for storing the persistent data. To date, there are no publicly available software tools that implement simulation of db-net models with reference semantics support. The paper presents a novel software tool for the db-nets simulation. The simulator is developed as a pure plugin for the Renew (Reference Net Workshop) software tool without modifying existing Renew source code. Such an approach for development of the simulator allows to reinforce existing Renew refer- ence semantics. The SQLite embeddable relational DBMS is used as a base tool to implement the db-net persistence layer. In the paper, the- oretical foundations and architecture of the developed simulator are de- scribed. The results of the work can be used in research projects that involve modeling complex software systems with persistent data for both academic and industry-oriented applications.
Event logs of information systems consist of recorded traces, describing executed activities and involved resources (e.g., users, data objects). Conformance checking is a family of process mining techniques that leverage such logs to detect whether observed traces deviate w.r.t some specification model (e.g., a Petri net). In this paper, we present a conformance checking method using colored Petri nets (CPNs) and event logs. CPN models allow not only to specify a causal ordering between system activities, but also they allow to describe how resources must be processed upon activity executions. By replaying each trace of an event log on top of a CPN, we present how this method detects: (1) control-flow deviations due to unavailable resources, (2) rule violations, and (3) differences between modeled and real produced resources. We illustrate in detail our method using the study case of trading systems, where orders from traders must be correctly processed by a platform. We describe experimental evaluations of our method to showcase its practical value.
DaNetQA, a new question-answering corpus, follows BoolQ [1] design: it comprises natural yes/no questions. Each question is paired with a paragraph from Wikipedia and an answer, derived from the paragraph. The task is to take both the question and a paragraph as input and come up with a yes/no answer, i.e. to produce a binary output. In this paper we present a reproducible approach to DaNetQA creation and investigate transfer learning methods for task and language transferring. For task transferring we leverage three similar sentence modelling tasks: 1) a corpus of paraphrases, Paraphraser, 2) an NLI task, for which we use the Russian part of XNLI, 3) another question answering task, SberQUAD. For language transferring we use English to Russian translation together with multilingual language fine-tuning.
The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers to identify metaphor with these vectors. We compare the performance of the topic modeling classifiers with other state-of-the-art features (lexical, morphosyntactic, semantic coherence, and concreteness-abstractness) and their different combinations to see how topics contribute to metaphor identification. We show that some of the topics are more frequent in metaphoric contexts while others are more characteristic of non-metaphoric sentences, thus constituting topic predictors of metaphoricity, and discuss whether these predictors align with the conceptual mappings attested in literature. We also compare the topical heterogeneity of metaphoric and non-metaphoric contexts in order to test the hypothesis that metaphoric discourse should display greater topical variability due to the presence of Source and Target domains.The paper examines the efficiency of topic models as features for computational identification and conceptual analysis of linguistic metaphor on Russian data. We train topic models using three algorithms (LDA and ARTM – sparse and dense) and evaluate their quality. We compute topic vectors for sentences of a metaphor-annotated Russian corpus and train several classifiers to identify metaphor with these vectors. We compare the performance of the topic modeling classifiers with other state-of-the-art features (lexical, morphosyntactic, semantic coherence, and concreteness-abstractness) and their different combinations to see how topics contribute to metaphor identification. We show that some of the topics are more frequent in metaphoric contexts while others are more characteristic of non-metaphoric sentences, thus constituting topic predictors of metaphoricity, and discuss whether these predictors align with the conceptual mappings attested in literature. We also compare the topical heterogeneity of metaphoric and non-metaphoric contexts in order to test the hypothesis that metaphoric discourse should display greater topical variability due to the presence of Source and Target domains.
The article discusses the influence of temperament on the academic performance of the first-year students at HSE-Nizhny Novgorod on the example of the Faculty of Informatics, Mathematics and Computer Science (IM&CS). The analyses were done with the help of statistics and educational data mining. The baseline data for the study is information about students, obtained by a survey: the information about temperament, degree of extraversion, stability, and other personality traits of students. The study involved students of the first and second years of the faculty of the IM&CS 2017-2018 academic year. Further, psychological factors affecting the average score and the probability of re-training for students with different temperaments were identified. A certain connection between temperament and academic success, which makes possible the prediction of "risky" students, was found. Various machine learning methods are used: the kNN-method and decision trees. The best results were shown by decision trees. As a result, first-year students are classified into three groups (Good, Medium, Bad) according to the degree of risk of getting academic debt. The practical result of the research was the recommendations to the educational office of the Faculty of IM&CS to pay attention to risky students and assist them in the educational process. After the end of the summer session, the classification results were checked. The article also presents an algorithm for finding risky students, taking temperament into account.
In the last years, news agencies have become more influential in various social groups. At the same time, the media industry starts to monetize online distributed articles with contextual advertising. However, the efficiency of online marketing highly depends on the popularity of news articles. In our work, we present an alternative and effective way for article popularity forecasting with two–step approach: article keywords extraction and keywords-based article popularity prediction. We show the benefits of this technique and compare with widely used methods, such as Text Embeddings and BERT–based methods. Moreover, the work provides an architecture of the model for dynamic keyword tracking trained on the newest dataset of Russian news articles with more than 280k articles and 22k keywords for the popularity of forecasting purposes.
This article deals with the principles of automatic label assignment for e-hypertext markup. We’ve identified 40 topics that are characteristic of hypertext media, after that, we used an ensemble of two graph-based methods using outer sources for candidate labels generation: candidate labels extraction from Yandex search engine (Labels-Yandex); candidate labels extraction from Wikipedia by operations on word vector representations in Explicit Semantic Analysis (ESA). The results of the algorithms are label’s triplets for each topic, after which we carried out a two-step evaluation procedure of the algorithms’ results: at the first stage, two experts assessed the triplet’s relevance to the topic on a 3-value scale (non-conformity to the topic/partial compliance to the topic/full compliance to the topic), second, we carried out evaluation of single labels by 10 assessors who were asked to mark each label by weights «0» – a label doesn’t match a topic; «1» – a label matches a topic. Our experiments show that in most cases Labels-Yandex algorithm predicts correct labels but frequently relates the topic to a label that is relevant to the current moment, but not to a set of keywords, while Labels-ESA works out labels with generalized content. Thus, a combination of these methods will make it possible to markup e-hypertext topics and create a semantic network theory of e-hypertext.
This article is devoted to the implementation of the federated approach to named entity recognition. The novel federated approach is designed to solve data privacy issues. The classic BiLSTM-CNNs-CRF and its modifications trained on a single machine are taken as baseline. Federated training is conducted for them. Influence of use of pretrained embedding, use of various blocks of architecture on training and quality of final model is considered. Besides, other important questions arising in practice are considered and solved, for example, creation of distributed private dictionaries, selection of base model for federated learning.

A model for organizing cargo transportation between two node stations connected by a railway line which contains a certain number of intermediate stations is considered. The movement of cargo is in one direction. Such a situation may occur, for example, if one of the node stations is located in a region which produce raw material for manufacturing industry located in another region, and there is another node station. The organization of freight traffic is performed by means of a number of technologies. These technologies determine the rules for taking on cargo at the initial node station, the rules of interaction between neighboring stations, as well as the rule of distribution of cargo to the final node stations. The process of cargo transportation is followed by the set rule of control. For such a model, one must determine possible modes of cargo transportation and describe their properties. This model is described by a finite-dimensional system of differential equations with nonlocal linear restrictions. The class of the solution satisfying nonlocal linear restrictions is extremely narrow. It results in the need for the “correct” extension of solutions of a system of differential equations to a class of quasi-solutions having the distinctive feature of gaps in a countable number of points. It was possible numerically using the Runge–Kutta method of the fourth order to build these quasi-solutions and determine their rate of growth. Let us note that in the technical plan the main complexity consisted in obtaining quasi-solutions satisfying the nonlocal linear restrictions. Furthermore, we investigated the dependence of quasi-solutions and, in particular, sizes of gaps (jumps) of solutions on a number of parameters of the model characterizing a rule of control, technologies for transportation of cargo and intensity of giving of cargo on a node station.
Event logs collected by modern information and technical systems usually contain enough data for automated process models discovery. A variety of algorithms was developed for process models discovery, conformance checking, log to model alignment, comparison of process models, etc., nevertheless a quick analysis of ad-hoc selected parts of a journal still have not get a full-fledged implementation. This paper describes an ROLAP-based method of multidimensional event logs storage for process mining. The result of the analysis of the journal is visualized as directed graph representing the union of all possible event sequences, ranked by their occurrence probability. Our implementation allows the analyst to discover process models for sublogs defined by ad-hoc selection of criteria and value of occurrence probability
The geographic information system (GIS) is based on the first and only Russian Imperial Census of 1897 and the First All-Union Census of the Soviet Union of 1926. The GIS features vector data (shapefiles) of allprovinces of the two states. For the 1897 census, there is information about linguistic, religious, and social estate groups. The part based on the 1926 census features nationality. Both shapefiles include information on gender, rural and urban population. The GIS allows for producing any necessary maps for individual studies of the period which require the administrative boundaries and demographic information.
Existing approaches suggest that IT strategy should be a reflection of business strategy. However, actually organisations do not often follow business strategy even if it is formally declared. In these conditions, IT strategy can be viewed not as a plan, but as an organisational shared view on the role of information systems. This approach generally reflects only a top-down perspective of IT strategy. So, it can be supplemented by a strategic behaviour pattern (i.e., more or less standard response to a changes that is formed as result of previous experience) to implement bottom-up approach. Two components that can help to establish effective reaction regarding new initiatives in IT are proposed here: model of IT-related decision making, and efficiency measurement metric to estimate maturity of business processes and appropriate IT. Usage of proposed tools is demonstrated in practical cases.