Инструменты анализа и разработки эффективного кода для параллельных архитектур

Монаков А. В.; Платонов В. А.; А. И. Аветисян

?

Инструменты анализа и разработки эффективного кода для параллельных архитектур

Труды Института системного программирования РАН. 2014. Т. 26. № 1. С. 357–374.

Монаков А. В., Платонов В. А., Avetisyan A.

The article proposes methods for supporting development of efficient programs for modern parallel architectures, including hybrid systems. First, specialized profiling methods designed for programmers tasked with parallelizing existing code are proposed. The first method is loop-based profiling via source-level instrumentation done with Coccinelle tool. The second method is memory reuse distance estimation via virtual memory protection mechanism and manual instrumentation. The third method is cache miss and false sharing estimation by collecting a partial trace of memory accesses using compiler instrumentation and estimating cache behavior in postprocessing based on the trace and a cache model. Second, the problem of automatic parallel code generation for hybrid architectures is discussed. Our approach is to generate OpenCL code from parallel loop nests based on GRAPHITE infrastructure in the GCC compiler. Finally, in cases where achieving high efficiency on hybrid systems requires significant rework of data structures or algorithms, one can employ auto-tuning to specialize for specific input data and hardware at run time. This is demonstrated on the problem of optimizing sparse matrix-vector multiplication for GPUs and its use for accelerating linear system solving in OpenFOAM CFD package. We propose a variant of “sliced ELLPACK” sparse matrix storage format with special treatment for small horizontal or diagonal blocks, where the exact parameters of matrix structure and GPU kernel launch should be automatically tuned at runtime for the specific matrix and GPU hardware.

Language: Russian

Full text

Text on another site

Keywords: CUDA исследование и оптимизация программ OpenCL OpenFOAM профилирование разреженные матрицы

GEMM Algorithm for Multi-GPU Platforms with Regular Uneven Data Transfer Links

Choi Y. R., Malkovsky S., Stegailov V., , in: 11th Russian Supercomputing Days, RuSCDays 2025, Moscow, Russia, September 29–30, 2025, Revised Selected Papers.: Springer, 2026. Ch. 3 P. 32–47.

Multi-GPU servers often exhibit uneven characteristics. For instance, the data transfer bandwidth between four NVIDIA V100 GPUs can vary due to the NVLink connecting these devices to a specific CPU in servers with IBM POWER 9 processors, which means that the communication bandwidth between other devices is comparably slower. To address this issue, the Multi-GPU ...

Added: January 3, 2026

Проблемы реализации права на свободу слова в эпоху Big Data

Лескина Э. И., Журнал российского права 2025 Т. 29 № 8 С. 50–65

The evolution of the understanding of freedom of speech occurs, among other things, in connection with the development of information and communication technologies, the ways in which people actually exercise freedom of speech. The development of platforms and social networks, in which big data plays a key role, lead us to a new era of ...

Added: September 4, 2025

Большие данные (Big Data) и охрана здоровья: возможности и риски

Лескина Э. И., В кн.: Взаимодействие власти, бизнеса и общества в сохранении и укреплении общественного здоровья.: Саратов: Издательство Саратовского университета, 2024.

Большие данные (Big Data) – комплексное явление. Охрана здоровья имеет двойственные отношения с технологией больших данных. С одной стороны, сведения из системы здравоохранения формируют огромные массивы данных. С другой стороны, аналитика данных прямым образом оказывает влияние на здравоохранение и медицинскую помощь. Рассматриваются направления влияния больших данных на здравоохранение, анализируются действующие стратегии зарубежных государств в области ...

Added: February 17, 2025

Big Data и борьба с терроризмом возможности и перспективы

Лескина Э. И., В кн.: Вызовы информационного общества: тенденции развития правового регулирования цифровых трансформаций: Монография по материалам 3.0 международной научно-практической конференции.: Саратов: ФГБОУ ВПО "Саратовская государственная юридическая академия", 2022. С. 81–88.

Over the past three years, more than six thousand crimes of a terrorist nature have been committed in the Russian Federation. Many of these crimes have digital traces, according to which these acts can be prevented or revealed. Universal digitalization, the development of the information society, the active use of information technologies by the public ...

Added: October 2, 2024

Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIe

Choi Y. R., Stegailov V., , in: 22nd International Conference, MMST 2022, Nizhny Novgorod, Russia, November 14–17, 2022, Revised Selected Papers.: Springer, 2022. Ch. 23 P. 281–292.

Modern types of multi-GPU servers combine up to 8 A100 GPUs connected by NVLink 3.0 links through NVSwitch. This connectivity provides unprecedented capabilities for multi-GPU algorithms. In this work, we analyze the performance of matrix-matrix multiplication algorithm developed by us previously. Tuning principles and limits for maximum performance are discussed. Algorithm performance for much more ...

Added: December 26, 2022

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Choi Y. R., Nikolskiy V., Stegailov V., , in: Parallel Computational Technologies: 16th International Conference, PCT 2022, Dubna, Russia, March 29–31, 2022, Revised Selected Papers.: Springer, 2022. Ch. 12 P. 158–171.

Added: August 11, 2022

Маньяк из соседнего канона: как наука нормализует культовое зло

Мария Марей, Философия. Журнал Высшей школы экономики 2022 Т. 6 № 2 С. 148–167

This article is devoted to studying cinematic images of serial criminals in a series of relevant topics: those where scientific and quasi-scientific methods, which are in Russian called “profiling”, are used to calculate and catch them. Assuming that cinema and television can change (and shape) a person's ideas about life, norms, about right or wrong, ...

Added: July 2, 2022

Algorithm for Adaptive Mesh Redistribution in Lattice Boltzmann Simulations

Ziganurova L., Shchur L., Lobachevskii Journal of Mathematics 2022 Vol. 43 No. 2 P. 513–518

The Lattice Boltzmann method (LBM) is the alternative approach for hydrodynamic equation solving. Two factors make it a favorite approach nowadays. Firstly, the attractive feature of LBM is that it is intrinsic for parallel simulations due to the linear structure of the algorithm. Secondly, what makes LBM special for the research, it is well applicable to the simulations ...

Added: May 25, 2022

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

Kondratyuk N., Nikolskiy V., Pavlov D. et al., International Journal of High Performance Computing Applications 2021 Vol. 35 No. 4 P. 312–324

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management ...

Added: June 25, 2021

Исследование применимости методов хранения разреженных матриц в задаче расчета переходных токов по конструкции космических аппаратов

Баринова С. А., Спирин Д. А., Vostrikov A. V. et al., Системный администратор 2021 № 5(222) С. 79–81

This work is devoted to the analysis of the applicability of storage methods for large sparse matrices for calculating electrical circuits. Conclusions were made about the condition of applicability of explicit Runge-Kutta methods. The result of the study can be integrated into the existing educational environment and be an auxiliary link in the procedure for ...

Added: May 19, 2021

Algorithm for replica redistribution in an implementation of the population annealing method on a hybrid supercomputer architecture

Russkov A., Chulkevich R., Shchur L., Computer Physics Communications 2021 Vol. 261 P. 107786

The population annealing method is a promising approach for large-scale simulations because it is potentially scalable on any parallel architecture. We present an implementation of the algorithm on a hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible by efficiently redistributing replicas. ...

Added: December 28, 2020

Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

Choi Y. R., Nikolskiy V., Stegailov V., , in: 2020 Global Smart Industry Conference (GloSIC).: IEEE, 2020. P. 354–361.

Added: December 3, 2020

Evaluating OpenMP, OpenACC and CUDA parallel programming models for the GPU: Performance Analysis

Timofeev A., Khalilov M., , in: Параллельные вычислительные технологии (ПаВТ'2020).: Chelyabinsk: ., 2020. P. 40–51.

Modern supercomputers use GPUs as accelerators in computing nodes. GPUs allow scientific applications to greatly boost performance using fine-grained parallelism. CUDA programming model oriented to take advantage of the SIMT GPU architecture writing low-level code. Contrary to this approach, OpenACC and OpenMP 4.5 represent a declarative model of parallel programming using compiler pragmas with support ...

Added: October 23, 2020

Performance and portability of state-of-art molecular dynamics software on modern GPUs

Kuznetsov E., Kondratyuk N., Logunov M. et al., , in: PPAM 2019: Parallel Processing and Applied Mathematics. Lecture Notes in Computer ScienceVol. 12043: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I.: Springer, 2020. P. 324–334.

Classical molecular dynamics (MD) calculations represent a significant part of utilization time of high performance computing systems. As usual, efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed MD packages focused on GPUs differ both in their data management capabilities and ...

Added: October 14, 2020

Algorithm for the replica redistribution in the implementation of parallel annealing method on the hybrid supercomputer architecture

Russkov A., Roman Chulkevich, Shchur L., / Series arXiv "math". 2020. No. 2006.00561.

The parallel annealing method is one of the promising approaches for large scale simulations as potentially scalable on any parallel architecture. We present an implementation of the algorithm on the hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible redistributing replicas and ...

Added: June 2, 2020

Производительность современных вычислительных платформ в расчетах молекулярной динамики белок - мембранных систем

Nolde D., Krylov N., Телегин П. Н. et al., Труды НИИСИ РАН 2018 Т. 7 № 4 С. 157–161

The performance of molecular dynamics software package Gromacs was measured on various hardware: desktop computers, clusters based on x84_64 processors or many integrated core processors, and heterogeneous system with gaming graphic cards or general purpose GPU systems. The optimal choice of hardware for molecular dynamics simulations is discussed. ...

Added: February 10, 2020

Implementation of an XSL block cipher with MDS-matrix linear transformation on NVIDIA CUDA

Fomin D., Математические вопросы криптографии 2015 Vol. 6 No. 2 P. 99–108

In this article we consider NVIDIA GPU implementation aspects of an XSL block cipher over the finite field with MDS-matrix linear transformation. We compare obtained results with some other block ciphers. ...

Added: May 4, 2019

A timing attack on CUDA implementations of an AES-type block cipher

Fomin D., Математические вопросы криптографии 2016 Vol. 7 No. 2 P. 121–130

A timing attack against an AES-type block cipher CUDA implementa- tion is presented. Our experiments show that it is possible to extract a secret AES 128-bit key with complexity of 2^32 chosen plaintext encryptions. This approach may be applied to AES with other key sizes and, moreover, to any block cipher with a linear transform that is ...

Added: May 4, 2019

Профилирование GATE Developer для выявления причины переполнения памяти

Макаров В. В., Lanin V., В кн.: Математика программных систем: межвуз. сб. науч. тр.Вып. 15.: Пермь: Пермский государственный национальный исследовательский университет, 2018. С. 44–49.

The article is prepared on the results of the British national corpus processing (BNC, British National Corpus) in the linguistic research system GATE Developer. The authors faced the problem of reduced performance as a result of incorrect distribution of RAM by the system. The paper investigates the problem of memory overflow, identifies possible causes of ...

Added: January 18, 2019

Оптимизация динамической загрузки библиотек на архитектуре ARM

Kudryashov E., Мельник Д. М., Монаков А. В., Труды Института системного программирования РАН 2016 Т. 28 № 1 С. 63–80

The paper discusses an optimization approach for external calls in positionindependent code that is based on loading the callee address immediately at the call site from the Global Offset Table (GOT), avoiding the use of the Procedure Linkage Table (PLT). Normally the Linux toolchain creates the PLT both in the main executable (which comprises position-dependent ...

Added: November 5, 2018