Understanding join strategies in distributed systems

?

Understanding join strategies in distributed systems

Tyryshkina Y.

In this paper, we consider the problem of reducing the cost of computer time by developing and implementing a method for accelerating the operation of connecting distributed data arrays according to a given criterion. The following tasks were solved: a study was conducted on the architecture of distributed data storages and parallel computing algorithms; on the basis of these studies, limiting stages have been established that slow down the processing process; a method was developed that excludes the established limiting stages; on the basis of the developed method, an algorithm and a utility were created that expand the functionality of the selected software product; experimental studies have been carried out

Language: English

In book

International Seminar on Electron Devices Design and Production, SED 2021

Сигов А. С. [б.и.], 2021.

Triclustering in Big Data Setting

Egurnov D., Точилкин Д. С., Ignatov D. I., , in: Complex Data Analytics with Formal Concept Analysis.: Springer, 2022. P. 239–258.

In this paper, we describe versions of triclustering algorithms adapted for efficient calculations in distributed environments with MapReduce model or parallelisation mechanism provided by modern programming languages. OAC-family of triclustering algorithms shows good parallelisation capabilities due to the independent processing of triples of a triadic formal context. We provide time and space complexity of the ...

Added: November 1, 2022

Accelerating join of distributed datasets by a given criterion

Tyryshkina Y., , in: Proceedings of 2022 IEEE Moscow Workshop on Electronic and Networking Technologies (MWENT).: M.: IEEE, 2022.

Added: May 31, 2022

Method for accelerating the operation of joining distributed datasets by a given criterion

Tyryshkina Y., , in: Международная научнопрактическая конференция «Информационные Инновационные Технологии», 2022.: [б.и.], 2022.

Added: May 31, 2022

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Agarkov A., Semenov A., , in: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young ScientistsVol. 1990: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists.: CEUR Workshop Proceedings, 2017. P. 92–101.

In this paper we consider an association problem with constraints for two dynamically enlarging tables. We consider a base full association algorithm and propose a partial association algorithm that improves efficiency of the base algorithm. We implement and evaluate the algorithms in Apache Spark for a particular case on the cluster with Angara interconnect. ...

Added: October 30, 2019

Simplified Mapreduce Mechanism for Large Scale Data Processing

Ahmed Munna M. T., International Journal of Engineering and Technology 2018 Vol. 7 No. 8 P. 16–21

MapReduce has become a popular programming model for processing and running large-scale data sets with a parallel, distributed paradigm on a cluster. Hadoop MapReduce is needed especially for large scale data like big data processing. In this paper, we work to modify the Hadoop MapReduce Algorithm and implement it to reduce processing time. ...

Added: October 29, 2019

Распределенные горизонтально масштабируемые решения для управления данными

С.Д. Кузнецов, Посконин А. В., Труды Института системного программирования РАН 2013 Т. 24 С. 327–258

Many modern applications (such as large-scale Web-sites, social networks, research projects, business analytics, etc.) have to deal with very large data volumes (also referred to as “big data”) and high read/write loads. These applications require underlying data management systems to scale well in order to accommodate data growth and increasing workloads. High throughput, low latencies ...

Added: January 30, 2018

Создание виртуальных кластеров Apache Spark в облачных средах с использованием систем оркестрации

Борисенко О. Д., Пастухов Р. К., С.Д. Кузнецов, Труды Института системного программирования РАН 2016 Т. 28 № 6 С. 111–120

Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual ...

Added: January 25, 2018

Реализация сервиса для выполнения Apache Spark задач и создания Apache Spark кластеров на основе Openstack Sahara

S. Kuznetsov, Борисенко О. Д., Алексиянц А. В. et al., Proceedings of the Institute for System Programming of the RAS 2015 Vol. 27 No. 5 P. 35–48

In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Existing methods for Apache Spark clusters creation are described in this work. Also the implemented solution for building Apache Spark clusters and Apache Spark jobs execution in Openstack environment is described. The ...

Added: January 23, 2018

Автоматическое создание виртуальных кластеров Apache Spark в облачной среде Openstack

Kuznetsov S. D., Turdakov D. Y., Борисенко О. Д., Труды Института системного программирования РАН 2014 Т. 26 № 4 С. 33–44

This article is dedicated to automation of cluster creation and management for Apache Spark MapReduce implementation in Openstack environments. As a result of this project open-source (Apache 2.0 license) implementation of toolchain for virtual cluster on-demand creation in Openstack environments was presented. The article contains an overview of existing solutions for clustering automation in cloud ...

Added: November 26, 2017

Большие данные: современные подходы к хранению и обработке

Клеменков П. А., Kuznetsov S. D., Труды Института системного программирования РАН 2012 Т. 23 С. 143–158

Big data challenged traditional storage and analysis systems in several new ways. In this paper we try to figure out how to overcome this challenges, why it's not possible to make it efficiently and describe three modern approaches to big data handling: NoSQL, MapReduce and real-time stream processing. The first section of the paper is ...

Added: October 31, 2017

Gomapreduce parallel computing model implementation on a cluster of plan9 virtual machines

Leokhin, Y., Myagkov, A., Panfilov, P., , in: 26th DAAAM International Symposium on Intelligent Manufacturing and Automation 2015Vol. 1.: NY: Curran Associates, Inc., 2015. P. 0656 – 0662.

In this paper, we present results of a computational evaluation of goMapReduce parallel programming model approach for solving distributed data processing problems. In some applications, particularly data center problems, including text processing the programming models can aggregate significant number of parallel processes. We first discuss the implementation of these approaches using both Linux and Plan9 ...

Added: November 26, 2016

Implementing Apache Spark jobs execution and Apache Spark cluster creation for Openstack Sahara

Turdakov D., Aleksiyants A., Borisenko O. et al., Proceedings of the Institute for System Programming of the RAS 2015 Vol. 27 No. 5 P. 35–48

In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Both clouds and MapReduce models are popular nowadays for a bunch of reasons: cheapness and efficient big data analysis respectively. For these thoughts, having an open source solution for building clusters is ...

Added: September 13, 2016

Applying MapReduce to Conformance Checking

Shugurov I., Mitsyuk A. A., Proceedings of the Institute for System Programming of the RAS 2016 Vol. 28 No. 3 P. 103–122

Process mining is a relatively new research field, offering methods of business processes analysis and improvement, which are based on studying their execution history (event logs). Conformance checking is one of the main sub-fields of process mining. Conformance checking algorithms are aimed to assess how well a given process model, typically represented by a Petri ...

Added: September 12, 2016

Putting OAC-triclustering on MapReduce

Зудин С., Gnatyshak D. V., Ignatov D. I., , in: Proceedings of the Twelfth International Conference on Concept Lattices and Their Applications Clermont-Ferrand, France, October 13-16, 2015Vol. 1466.: Clermont-Ferrand: CEUR Workshop Proceedings, 2015. P. 47–58.

In our previous work an efficient one-pass online algorithm for triclustering of binary data (triadic formal contexts) was proposed. This algorithm is a modified version of the basic algorithm for OAC-triclustering approach; it has linear time and memory complexities. In this paper we parallelise it via map-reduce framework in order to make it suitable for big datasets. The results of ...

Added: October 23, 2015