?
Understanding join strategies in distributed systems
.
Tyryshkina Y.
In this paper, we consider the problem of reducing the cost of computer time by developing and implementing a method for accelerating the operation of connecting distributed data arrays according to a given criterion. The following tasks were solved: a study was conducted on the architecture of distributed data storages and parallel computing algorithms; on the basis of these studies, limiting stages have been established that slow down the processing process; a method was developed that excludes the established limiting stages; on the basis of the developed method, an algorithm and a utility were created that expand the functionality of the selected software product; experimental studies have been carried out
Language:
English
In book
Сигов А. С. [б.и.], 2021.
Egurnov D., Точилкин Д. С., Ignatov D. I., , in: Complex Data Analytics with Formal Concept Analysis.: Springer, 2022. P. 239–258.
In this paper, we describe versions of triclustering algorithms adapted for efficient calculations in distributed environments with MapReduce model or parallelisation mechanism provided by modern programming languages. OAC-family of triclustering algorithms shows good parallelisation capabilities due to the independent processing of triples of a triadic formal context. We provide time and space complexity of the ...
Added: November 1, 2022
Tyryshkina Y., , in: Proceedings of 2022 IEEE Moscow Workshop on Electronic and Networking Technologies (MWENT).: M.: IEEE, 2022.
Added: May 31, 2022
Tyryshkina Y., , in: Международная научнопрактическая конференция «Информационные Инновационные Технологии», 2022.: [б.и.], 2022.
In this paper, we consider the problem of reducing the cost of computer time by developing and implementing a method for accelerating the operation of connecting distributed data arrays according to a given criterion. The following tasks were solved: a study was conducted on the architecture of distributed data storages and parallel computing algorithms; on ...
Added: May 31, 2022
Agarkov A., Semenov A., , in: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young ScientistsVol. 1990: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists.: CEUR Workshop Proceedings, 2017. P. 92–101.
In this paper we consider an association problem with constraints for two dynamically enlarging tables. We consider a base full association algorithm and propose a partial association algorithm that improves efficiency of the base algorithm. We implement and evaluate the algorithms in Apache Spark for a particular case on the cluster with Angara interconnect. ...
Added: October 30, 2019
Ahmed Munna M. T., International Journal of Engineering and Technology 2018 Vol. 7 No. 8 P. 16–21
MapReduce has become a popular programming model for processing and running large-scale data sets with a parallel, distributed paradigm on a cluster. Hadoop MapReduce is needed especially for large scale data like big data processing. In this paper, we work to modify the Hadoop MapReduce Algorithm and implement it to reduce processing time. ...
Added: October 29, 2019
С.Д. Кузнецов, Посконин А. В., Труды Института системного программирования РАН 2013 Т. 24 С. 327–258
Many modern applications (such as large-scale Web-sites, social networks, research projects, business analytics, etc.) have to deal with very large data volumes (also referred to as “big data”) and high read/write loads. These applications require underlying data management systems to scale well in order to accommodate data growth and increasing workloads. High throughput, low latencies ...
Added: January 30, 2018
Борисенко О. Д., Пастухов Р. К., С.Д. Кузнецов, Труды Института системного программирования РАН 2016 Т. 28 № 6 С. 111–120
Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual ...
Added: January 25, 2018
S. Kuznetsov, Борисенко О. Д., Алексиянц А. В. et al., Proceedings of the Institute for System Programming of the RAS 2015 Vol. 27 No. 5 P. 35–48
In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Existing methods for Apache Spark clusters creation are described in this work. Also the implemented solution for building Apache Spark clusters and Apache Spark jobs execution in Openstack environment is described. The ...
Added: January 23, 2018
Kuznetsov S. D., Turdakov D. Y., Борисенко О. Д., Труды Института системного программирования РАН 2014 Т. 26 № 4 С. 33–44
This article is dedicated to automation of cluster creation and management for Apache Spark MapReduce implementation in Openstack environments. As a result of this project open-source (Apache 2.0 license) implementation of toolchain for virtual cluster on-demand creation in Openstack environments was presented. The article contains an overview of existing solutions for clustering automation in cloud ...
Added: November 26, 2017
Клеменков П. А., Kuznetsov S. D., Труды Института системного программирования РАН 2012 Т. 23 С. 143–158
Big data challenged traditional storage and analysis systems in several new ways. In this paper we try to figure out how to overcome this challenges, why it's not possible to make it efficiently and describe three modern approaches to big data handling: NoSQL, MapReduce and real-time stream processing. The first section of the paper is ...
Added: October 31, 2017
Leokhin, Y., Myagkov, A., Panfilov, P., , in: 26th DAAAM International Symposium on Intelligent Manufacturing and Automation 2015Vol. 1.: NY: Curran Associates, Inc., 2015. P. 0656 – 0662.
In this paper, we present results of a computational evaluation of goMapReduce parallel programming model approach for solving distributed data processing problems. In some applications, particularly data center problems, including text processing the programming models can aggregate significant number of parallel processes. We first discuss the implementation of these approaches using both Linux and Plan9 ...
Added: November 26, 2016
Turdakov D., Aleksiyants A., Borisenko O. et al., Proceedings of the Institute for System Programming of the RAS 2015 Vol. 27 No. 5 P. 35–48
In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Both clouds and MapReduce models are popular nowadays for a bunch of reasons: cheapness and efficient big data analysis respectively. For these thoughts, having an open source solution for building clusters is ...
Added: September 13, 2016
Shugurov I., Mitsyuk A. A., Proceedings of the Institute for System Programming of the RAS 2016 Vol. 28 No. 3 P. 103–122
Process mining is a relatively new research field, offering methods of business processes analysis and improvement, which are based on studying their execution history (event logs). Conformance checking is one of the main sub-fields of process mining. Conformance checking algorithms are aimed to assess how well a given process model, typically represented by a Petri ...
Added: September 12, 2016
Зудин С., Gnatyshak D. V., Ignatov D. I., , in: Proceedings of the Twelfth International Conference on Concept Lattices and Their Applications Clermont-Ferrand, France, October 13-16, 2015Vol. 1466.: Clermont-Ferrand: CEUR Workshop Proceedings, 2015. P. 47–58.
In our previous work an efficient one-pass online algorithm
for triclustering of binary data (triadic formal contexts) was proposed.
This algorithm is a modified version of the basic algorithm for OAC-triclustering
approach; it has linear time and memory complexities. In
this paper we parallelise it via map-reduce framework in order to make
it suitable for big datasets. The results of ...
Added: October 23, 2015