Industrial track: Architecting railway KPIs data processing with Big Data technologies
In our conducted research we have built the data processing pipeline for storing railway KPIs data based on Big Data open-source technologies – Apache Hadoop, Kafka, Kafka HDFS Connector, Spark, Airflow and PostgreSQL. Created methodology for data load testing allowed to iteratively perform data load tests with increased data size and evaluate needed cluster software and hardware resources and, finally, detected bottlenecks of solution. As a result of the research we proposed architecture for data processing and storage, gave recommendations on data pipeline optimization. In addition, we calculated approximate cluster machines sizing for current dataset volume for data processing and storage services.
In large-scale research, data are usually collected on many sites, have a huge volume, and new data are constantly generated. Since it is often impossible to collect all the relevant data on a single computer, much attention is paid to the algorithms that provide sequential or parallel accumulation of information and do not need to store all the original data. As an example of information accumulation, the Bayesian updating procedure for linear experiments is analyzed. The corresponding information spaces are defined and the relations between them are studied. It is shown that processing can be unified and simplified by introducing a special canonical form of information representation and transforming all the data and the original prior information into this form. Thanks to the rich algebraic properties of the canonical information space, the sequential Bayesian procedure allows various parallelization options that are ideally suited for distributed data processing platforms, such as Hadoop MapReduce. This opens up the possibility of a flexible and efficient scaling of information accumulation in distributed data processing systems.
The process of Bayesian information update is essentially sequential: as a result of observation, a prior information is transformed to a posterior, which is later interpreted as a prior for the next observation, etc. It is shown that this procedure can be unified and parallelized by converting both the measurement results and the original prior information to a special form. Various forms of information representation and relations between them are studied. Rich algebraic properties of the introduced canonical information space allow to efficiently scale Bayesian procedure and adapt it to processing large amounts of distributed data.
Big data challenged traditional storage and analysis systems in several new ways. In this paper we try to figure out how to overcome this challenges, why it's not possible to make it efficiently and describe three modern approaches to big data handling: NoSQL, MapReduce and real-time stream processing. The first section of the paper is the introduction. The second section discuss main issues of Big Data: volume, diversity, velocity, and value. The third section describes different approaches to solving the problem of Big Data. Traditionally one might use a relational DBMS. The paper propose some steps that allow to continue RDBMS using when it’s capacity becomes not enough. Another way is to use a NoSQL approach. The basic ideas of the NoSQL approach are: simplification, high throughput, and unlimited scaling out. Different kinds of NoSQL stores allow to use such systems in different applications of Big Data. MapReduce and it’s free implementation Hadoop may be used to provide scaling out Big Data analytics. Finally, several data management products support real time stream processing under Big Data. The paper briefly overviews these products. The final section of the paper is the conclusion.
Apache Spark is one of the most popular Big Data frameworks. Performance evaluation of Big Data frameworks is a topic of interest due to the increasing number and importance of data analytics applications within the context of HPC and Big Data convergence. In the paper we present early performance evaluation of a typical supervised graph anomaly detection problem implemented using GraphX and MLlib libraries in Apache Spark on a cluster.
The use of Big Data technology has been a modern trend in the travel industry over the last 10 years. At present, almost all travel companies that desire to stay profitable and be customeroriented use the Big Data technology. Therefore, we have several questions to answer: should we use Big Data in tourism or should we not? How to use it? What kind of risks we should consider in order to achieve effective results? These research problems were examined through a thorough analysis of Russian and world travel markets using statistical data on several sites, programs, and organizations that are associated with the tourism industry (e.g., Booking.com, Trivago). The main result of this study is to substantiate the importance of Big Data technology for the travel industry. Big Data technology helps to personally connect companies and clients of the sector so that their interaction would lead to their mutual benefits. The net result of this interaction is an increase in the economical aspect of the sector and thus the country’s growth.
Process mining is a relatively new research field, offering methods of business processes analysis and improvement, which are based on studying their execution history (event logs). Conformance checking is one of the main sub-fields of process mining. Conformance checking algorithms are aimed to assess how well a given process model, typically represented by a Petri net, and a corresponding event log fit each other. Alignment-based conformance checking is the most advanced and frequently used type of such algorithms. This paper deals with the problem of high computational complexity of the alignment-based conformance checking algorithm. Currently, alignment-based conformance checking is quite inefficient in terms of memory consumption and time required for computations. Solving this particular problem is of high importance for checking conformance between real-life business process models and event logs, which might be quite problematic using existing approaches. MapReduce is a popular model of parallel computing which allows for simple implementation of efficient and scalable distributed calculations. In this paper, a MapReduce version of the alignment-based conformance checking algorithm is described and evaluated. We show that conformance checking can be distributed using MapReduce and can benefit from it. Moreover, it is demonstrated that computation time scales linearly with the growth of event log size.