Industrial track: Architecting railway KPIs data processing with Big Data technologies

Suleykin, A.; Panfilov, P.; Bakhtadze, N.

doi:10.1109/BigData47090.2019.9006196

Publications

Сhapter

Industrial track: Architecting railway KPIs data processing with Big Data technologies

P. 2047–2056.

Suleykin, A., Panfilov, P., Bakhtadze, N.

In our conducted research we have built the data processing pipeline for storing railway KPIs data based on Big Data open-source technologies – Apache Hadoop, Kafka, Kafka HDFS Connector, Spark, Airflow and PostgreSQL. Created methodology for data load testing allowed to iteratively perform data load tests with increased data size and evaluate needed cluster software and hardware resources and, finally, detected bottlenecks of solution. As a result of the research we proposed architecture for data processing and storage, gave recommendations on data pipeline optimization. In addition, we calculated approximate cluster machines sizing for current dataset volume for data processing and storage services.

Language: English

DOI

Text on another site

Keywords: Hadoop Spark Big data technologies distributed data processing railway KPI

In book

2019 IEEE International Conference on Big Data (Big Data)

IEEE, 2019.

Metadata-Driven Industrial-Grade ETL System

Suleykin A., Panfilov P., , in: 2020 IEEE International Conference on Big Data (Big Data 2020).: IEEE, 2020. P. 2433–2442..

Digital transformation of a railway system based on big data technologies relies on integrating large volumes of streaming data into digitally enabled enterprise systems to form a comprehensive and efficient intelligent transportation system. Data requirements of the smart railway transportation involve a large number of unstructured data and semi-structured data including railway KPI data. Traditional ...

Added: April 16, 2021

Unsupervised Graph Anomaly Detection Algorithms Implemented in Apache Spark

Semenov A., Mazeev A., Dmitry D. et al., Lobachevskii Journal of Mathematics 2018 Vol. 39 No. 9 P. 1262–1269.

The graph anomaly detection problem occurs in many application areas and can be solved by spotting outliers in unstructured collections of multi-dimensional data points, which can be obtained by graph analysis algorithms. We implement the algorithm for the small community analysis and the approximate LOF algorithm based on Locality-Sensitive Hashing, apply the algorithms to a ...

Added: June 10, 2019

Большие данные: современные подходы к хранению и обработке

Клеменков П. А., Kuznetsov S. D., Труды Института системного программирования РАН 2012 Т. 23 С. 143–158.

Big data challenged traditional storage and analysis systems in several new ways. In this paper we try to figure out how to overcome this challenges, why it's not possible to make it efficiently and describe three modern approaches to big data handling: NoSQL, MapReduce and real-time stream processing. The first section of the paper is ...

Added: October 31, 2017

Observations of the connection of positive and negative leaders in meter-scale electric discharges generated by clouds of negatively charged water droplets

Kostinskiy A., Syssoev V. S., Bogatov N. A. et al., JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES 2016 Vol. 121 No. 16 P. 9756–9766.

Detailed observations of the connection between positive and negative leaders in meter-scale electric discharges generated by clouds of negatively charged water droplets are presented, and their possible implications for the attachment process in lightning are discussed. Optical images obtained with three different high-speed cameras (visible range with image enhancement, visible-range regular, and infrared) and corresponding ...

Added: October 29, 2016

Early Performance Evaluation of Supervised Graph Anomaly Detection Problem Implemented in Apache Spark

Mazeev A., Semenov A., Dmitry D. et al., , in: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young ScientistsVol. 1990: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists.: CEUR Workshop Proceedings, 2017. P. 84–91..

Apache Spark is one of the most popular Big Data frameworks. Performance evaluation of Big Data frameworks is a topic of interest due to the increasing number and importance of data analytics applications within the context of HPC and Big Data convergence. In the paper we present early performance evaluation of a typical supervised graph ...

Added: October 30, 2019

Applying MapReduce to Conformance Checking

Shugurov I., Mitsyuk A. A., Proceedings of the Institute for System Programming of the RAS 2016 Vol. 28 No. 3 P. 103–122.

Process mining is a relatively new research field, offering methods of business processes analysis and improvement, which are based on studying their execution history (event logs). Conformance checking is one of the main sub-fields of process mining. Conformance checking algorithms are aimed to assess how well a given process model, typically represented by a Petri ...

Added: September 12, 2016

Большие данные и их приложения в электроэнергетике: от бизнес аналитики до виртуальных электростанций

Krylov V., Крылов С. В., М.: Нобель Пресс, 2014..

Предназначена для студентов и специалистов в области разработки информационных систем в том числе для электроэнергетики и руководителей ИТ подразделений предприятий, всем, кто работает над планированием направлений развития электроэнергетики и просто интересуется прогресcом в этой области В книге рассматривается направление в области обработки данных, получившее название Большие Данные (Big Data), рассказывается о техниках и технологиях. Главный фокус ...

Added: October 10, 2015

Modeling of information attacks, and security risk assessment facilities

Nazarov A., Nguyen Xuan T., Tran Minh H., T-Comm: Telecommunications and transport 2016 Vol. 10 No. 8 P. 69–78.

On the basis of logical-probabilistic approach developed logical-probabilistic models of information security assessment of the object of attack. The models are based on the current level of knowledge to counter attacks and allow the information to take into account technological features, especially the functioning of the object of attack, regulations and any requirements. The properties ...

Added: September 14, 2016

Information spaces: optimizing sequential and parallel processing in big data

Golubtsov P., , in: 7th International conference "Problems of Mathematical Physics and Mathematical Modelling” (2018) Book of abstracts.: M.: National Research Nuclear University "MEPhI", 2018. P. 173–176..

The process of Bayesian information update is essentially sequential: as a result of observation, a prior information is transformed to a posterior, which is later interpreted as a prior for the next observation, etc. It is shown that this procedure can be unified and parallelized by converting both the measurement results and the original prior ...

Added: January 23, 2019

Big Data and travel industry

Булгаков А. Л., Financial and Economic Tools Used in the World Hospitality Industry: Proceedings of the 5th International Conference on Management and Technology in Knowledge, Service, Tourism & Hospitality 2017 (SERVE 2017), 21-22 October 2017 & 30 November 2017 2018 Vol. 1 P. 265–270.

The use of Big Data technology has been a modern trend in the travel industry over the last 10 years. At present, almost all travel companies that desire to stay profitable and be customeroriented use the Big Data technology. Therefore, we have several questions to answer: should we use Big Data in tourism or should ...

Added: October 30, 2018

RePlay: a Recommendation Framework for Experimentation and Production Use

Vasilev A., Volodkevich Anna, Kulandin D. et al., , in: RecSys '24: Proceedings of the 18th ACM Conference on Recommender Systems.: Association for Computing Machinery (ACM), 2024. P. 1191–1194..

Added: November 24, 2024

Information Spaces for Big Data Processing: Unification and Parallelization of Sequential Information Accumulation Procedures

Golubtsov P., , in: 21st IEEE Conference on Business Informatics (CBI).: IEEE Computer Society, 2019. P. 212–220..

In large-scale research, data are usually collected on many sites, have a huge volume, and new data are constantly generated. Since it is often impossible to collect all the relevant data on a single computer, much attention is paid to the algorithms that provide sequential or parallel accumulation of information and do not need to ...

Added: July 31, 2019