Implementing Apache Spark jobs execution and Apache Spark cluster creation for Openstack Sahara

Turdakov D.; Aleksiyants A.; Borisenko O.; Sher A.; Kuznetsov S.

doi:10.15514/ISPRAS-2015-27(5)-3

Publications

?

Implementing Apache Spark jobs execution and Apache Spark cluster creation for Openstack Sahara

Proceedings of the Institute for System Programming of the RAS. 2015. Vol. 27. No. 5. P. 35–48.

Turdakov D., Aleksiyants A., Borisenko O., Sher A., Kuznetsov S.

In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Both clouds and MapReduce models are popular nowadays for a bunch of reasons: cheapness and efficient big data analysis respectively. For these thoughts, having an open source solution for building clusters is important. The article gives an overview on existing methods for Apache Spark cluster creation in clouds. We consider two open source cloud engines OpenStack and Eucalyptus and the most popular proprietary cloud service Amazon Web Services and examine cloud related features presented by these systems. Afterwards, we regard possible ways of creating virtual clusters for big data processing in OpenStack and describe their pros and cons. In the second part we describe in details one of these solutions that uses service Sahara. Sahara represents a cluster management system for OpenStack and it is used for setting up virtual clusters and executing MapReduce jobs. Sahara did not support contemporary versions of Apache Spark. The article introduces the results of our work that led to a Sahara modification, describes its idea and implementation details. By virtue of our modification, Sahara is able to create and use virtual clusters with contemporary versions of Apache Spark in OpenStack clouds.

Priority areas: IT and mathematics

Language: English

DOI

Keywords: IaaS Apache Spark Openstack

ML-based Fast Simulation of FARICH Responses

Shipilov F., Barnyakov A., Ivanov A. et al., / Series Physics "arxiv.org". 2026.

A fast simulation of the detector response is a vital task in high-energy physics (HEP). Traditional Monte-Carlo methods form the backbone of modern particle physics simulation software but are computationally expensive. We present a machine-learning-based approach to fast simulation of the Focusing Aerogel Ring Imaging Cherenkov (FARICH) detector response. Given a particle track and momentum, ...

Added: May 19, 2026

Natural hazard database from Internet publications: text mining with a large language model

Derkacheva A., Sakirkina M., Kraev G. et al., /. 2026.

Comprehensive data on natural hazards and their consequences are crucial for effective for risk assessment, adaptation planning, and emergency response. However, many countries face challenges with fragmented, inconsistent, and inaccessible data, particularly regarding local-scale events. To address this data gap in Russia, we developed an end-to-end processing pipeline that scrapes news from various online sources, ...

Added: April 28, 2026

Algorithmic overlaps as thermodynamic variables: from local to cluster Monte Carlo dynamics in critical phenomena

Pilé I., Deng Y., Shchur L., / Series arXiv "math". 2026. No. 2604.10254.

We investigate the spatial overlap of successive spin configurations in Markov chain Monte Carlo simulations using the local Metropolis algorithm and the Svendsen-Wang and Wolff cluster algorithms. We examine the dynamics of these algorithms for two models in different universality classes: the Ising model and the Potts model with three components. The overlap of two ...

Added: April 20, 2026

Using predefined vector systems to speed up neural network multimillion class classification

Gabdullin N., Androsov I., / Series Computer Science "arxiv.org". 2026.

Label prediction in neural networks (NNs) has O(n) complexity proportional to the number of classes. This holds true for classification using fully connected layers and cosine similarity with some set of class prototypes. In this paper we show that if NN latent space (LS) geometry is known and possesses specific properties, label prediction complexity can ...

Added: April 2, 2026

Iterative Ricci-Foster Curvature Flow with GMM-Based Edge Pruning: A Novel Approach to Community Detection

Sorokin K., Beketov M., Онучин А. et al., / arxiv.org. Серия cs.SI "Social and Information Networks ". 2025.

Community detection in complex networks is a fundamental problem, open to new approaches in various scientific settings. We introduce a novel community detection method, based on Ricci flow on graphs. Our technique iteratively updates edge weights (their metric lengths) according to their (combinatorial) Foster version of Ricci curvature computed from effective resistance distance between the ...

Added: January 15, 2026

Implementing Transport Coding in OMNeT++ for Message Delay Reduction

Petrovanov I., Sergeev A., / Series Computer Science "arxiv.org". 2025. No. 2512.18332.

Transport coding reduces message delay in packet-switched networks by introducing controlled redundancy at the transport layer: original packets are encoded into coded packets, and the message is reconstructed after the first successful deliveries, effectively shifting latency from the maximum packet delay to the -th order statistic. We present a concise, reproducible discrete-event implementation of transport coding in OMNeT++, including ...

Added: December 24, 2025

Hessian-based lightweight neural network for brain vessel segmentation on a minimal training dataset

Меньшиков И. А., Бернадотт А. К., Elvimov N. S., / Series arXie "Statistical mechanics". 2025.

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image ...

Added: December 1, 2025

Determining the boundary of dynamical chaos in the generalized Chirikov map via machine learning

Chernyshov D., Satanin A., Shchur L., / Series arXiv "math". 2025.

We investigate the boundary separating regular and chaotic dynamics in the generalized Chirikov map, an extension of the standard map with phase-shifted secondary kicks. Lyapunov maps were computed across the parameter space (K,K(α, τ)) and used to train a convolutional neural network (ResNet18) for binary classification of dynamical regimes. The model reproduces the known critical ...

Added: November 21, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Diffusion on language model embeddings for protein sequence generation

Meshchaninov V., Strashnov, P., Shevtsov A. et al., / Cornell University. Серия CoRR, arXiv:2403.03726 "Computing Research Repository,". 2025.

Protein design requires a deep understanding of the inherent complexities of the protein universe. While many efforts lean towards conditional generation or focus on specific families of proteins, the foundational task of unconditional generation remains underexplored and undervalued. Here, we explore this pivotal domain, introducing DiMA, a model that leverages continuous diffusion on embeddings derived ...

Added: October 5, 2025

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

Shabalin A., Meshchaninov V., Vetrov D., / Series cs.CL, arXiv:2505.18853 "Computation and Language". 2025.

Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic ...

Added: October 5, 2025

A Feature Engineering Framework for Computer Vision Based on Topological Data Analysis

Абрамов А. С., Chernyshev V. L., Mikhaylets E. et al., / Series Social Science Research Network "Social Science Research Network". 2025.

Computer vision is one of the most relevant modern research areas with broad practical applications. However, traditional solutions based on deep learning have signicant limitations and can be misleading. Topological data analysis, on the other hand, is a modern approach to solving similar problems using mathematically deterministic methods of algebraic topology that reduce the risk ...

Added: September 23, 2025

On the construction of frieze patterns from partitions of convex polygons by nonintersecting diagonals

Kochetkov Y., / Series arXiv.org e-print archive "arXiv.math". 2025. No. 07600.

We demonstrate in an elementary way how to construct a frieze pattern of width m-3 from a partition of a convex m-gon by not intersecting diagonals. ...

Added: September 17, 2025

A two-phase heuristic algorithm for power-aware offline scheduling in IaaS clouds

Ignatov A., Maslova I., Posypkin M. et al., Journal of Parallel and Distributed Computing 2023 Vol. 178 P. 1–10

The paper aims at mitigating hot-spots during Offline Scheduling in IaaS (Infrastructure-as-a-Service) cloud systems. Unlike previous studies, the research focuses on identifying and resolving hot-spots not at servers, but at server racks. A two-phase algorithm for performing power-aware offline scheduling is proposed. The first phase aims at identifying and mitigating hot-spots at racks, while the ...

Added: May 12, 2023

Internet of Things: Analysis of Parameters and Requirements

Ebraheem A., Ivanov I., , in: 2022 International Conference on Smart Applications, Communications and Networking (SmartNets).: IEEE, 2022. P. 01–04.

Systems of Internet of Things are relatively complex to design and maintain. Different parameters affect their performance and outcome. A four-layer architecture is used to analyze the different parameters of an IoT system. In each layer a detailed description of the important factors that have an impact on the design is presented. Guidelines for better ...

Added: April 18, 2023

Understanding join strategies in distributed systems

Tyryshkina Y., , in: International Seminar on Electron Devices Design and Production, SED 2021.: [б.и.], 2021.

In this paper, we consider the problem of reducing the cost of computer time by developing and implementing a method for accelerating the operation of connecting distributed data arrays according to a given criterion. The following tasks were solved: a study was conducted on the architecture of distributed data storages and parallel computing algorithms; on ...

Added: June 2, 2022

Performance Evaluation of Large Table Association Problem Implemented in Apache Spark on Cluster with Angara Interconnect

Agarkov A., Semenov A., , in: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young ScientistsVol. 1990: Proceedings of the 3rd Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists.: CEUR Workshop Proceedings, 2017. P. 92–101.

In this paper we consider an association problem with constraints for two dynamically enlarging tables. We consider a base full association algorithm and propose a partial association algorithm that improves efficiency of the base algorithm. We implement and evaluate the algorithms in Apache Spark for a particular case on the cluster with Angara interconnect. ...

Added: October 30, 2019

Создание виртуальных кластеров Apache Spark в облачных средах с использованием систем оркестрации

Борисенко О. Д., Пастухов Р. К., С.Д. Кузнецов, Труды Института системного программирования РАН 2016 Т. 28 № 6 С. 111–120

Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual ...

Added: January 25, 2018

Реализация сервиса для выполнения Apache Spark задач и создания Apache Spark кластеров на основе Openstack Sahara

S. Kuznetsov, Борисенко О. Д., Алексиянц А. В. et al., Proceedings of the Institute for System Programming of the RAS 2015 Vol. 27 No. 5 P. 35–48

In this paper the problem of creating virtual clusters in clouds for big data analysis with Apache Hadoop and Apache Spark is discussed. Existing methods for Apache Spark clusters creation are described in this work. Also the implemented solution for building Apache Spark clusters and Apache Spark jobs execution in Openstack environment is described. The ...

Added: January 23, 2018

Автоматическое создание виртуальных кластеров Apache Spark в облачной среде Openstack

Kuznetsov S. D., Turdakov D. Y., Борисенко О. Д., Труды Института системного программирования РАН 2014 Т. 26 № 4 С. 33–44

This article is dedicated to automation of cluster creation and management for Apache Spark MapReduce implementation in Openstack environments. As a result of this project open-source (Apache 2.0 license) implementation of toolchain for virtual cluster on-demand creation in Openstack environments was presented. The article contains an overview of existing solutions for clustering automation in cloud ...

Added: November 26, 2017

Метод тестирования производительности и стресс-тестирования центральных сервисов идентификации облачных систем на примере Openstack Keystone

Avetisyan A., Богомолов И. В., Алексиянц А. В. et al., Труды Института системного программирования РАН 2015 Т. 27 № 5 С. 49–58

Nowadays OpenStack platform is a leading solution in cloud computing field. Keystone, the OpenStack Identity Service, is one of its major components. This service is responsible for authentication and authorization of users and services of the system. Keystone is a high-load service since all interactions between services happen through it. This leads us to the ...

Added: March 22, 2017

Approaches to the development of a mediacontent delivery network based on the infrastructure of existing saas and iaas providers

Korolev D., Gorokhova-Alekseyeva A., , in: Proceedings of the 2016 IEEE Conference on Quality Management, Transport and Information Security, Information Technologies (IT&MQ&IS-2016).: St. Petersburg: IEEE, 2016. P. 99–102.

Video broadcasts on the Internet have become a commonplace and increasingly find their audience, supported by popular video services and social networks. But there are tasks, that require content delivery network (CDN), which lead to extra expences, and moreover, does not give sufficient flexibility and limits personalization of the broadcasts. This paper presents the principles ...

Added: February 26, 2017