Evaluating OpenMP, OpenACC and CUDA parallel programming models for the GPU: Performance Analysis

A. Timofeev; M. Khalilov

doi:10.14529/pct2020

Publications

?

Evaluating OpenMP, OpenACC and CUDA parallel programming models for the GPU: Performance Analysis

P. 40-51.

Timofeev A., Khalilov M.

Modern supercomputers use GPUs as accelerators in computing nodes. GPUs allow scientific applications to greatly boost performance using fine-grained parallelism. CUDA programming model oriented to take advantage of the SIMT GPU architecture writing low-level code. Contrary to this approach, OpenACC and OpenMP 4.5 represent a declarative model of parallel programming using compiler pragmas with support of GPU offloading. In this paper the efficiency of matrix multiplication using these programming models is considered. A comparative analysis of the performance of naive and hand tuned matrix multiplication on Nvidia Tesla V100 and MX940 GPUs and modern CPUs is carried out. Analysis of vendor-optimized BLAS libraries is also present.

Language: English

DOI

Text on another site

Keywords: CUDA GPU acceleration

In book

Параллельные вычислительные технологии (ПаВТ'2020)

Chelyabinsk : ., 2020

A timing attack on CUDA implementations of an AES-type block cipher

Fomin D., Математические вопросы криптографии 2016 Vol. 7 No. 2 P. 121-130

A timing attack against an AES-type block cipher CUDA implementa- tion is presented. Our experiments show that it is possible to extract a secret AES 128-bit key with complexity of 2^32 chosen plaintext encryptions. This approach may be applied to AES with other key sizes and, moreover, to any block cipher with a linear transform that is ...

Added: May 4, 2019

Comparison of old and new cryptographic hash function standards of the Russian Federation on CPUs and NVIDIA GPUs

Lebedev P. A., Математические вопросы криптографии 2013 Vol. 4 No. 2 P. 73-80

We present optimization guidelines and implementations of cryptographic hash functions GOST R 34.11-94 and GOST R 34.11-2012. Results for x86_64 CPUs and NVIDIA CUDA-capable GPUs are provided for our and several other well-known implementations. It is shown that the new standard may be twice as fast as the old one on modern CPUs, but it ...

Added: April 1, 2013

О применении технологии CUDA для обработки изображений и распознаванию графических образов

Gostev I. M., В кн. : Распределенные вычисления и ГРИД-технологии в науке и образовании. Труды 5-й международной конференции Дубна, 16-21 июля 2012 г. : Дубна : Объединенный институт ядерных исследований, 2012. С. 274-279.

Решение задач по обработке изображений и распознаванию графических образов обычно опирается на некоторою технологию, заключающую в себя последовательность некоторых операций.В работе исследовано затрачиваемое на обработку время, которое зависит от их количества и трудоемкости, размеров входного изображения и скорости передачи информации между отдельными этапами обработки. ...

Added: July 19, 2013

PPAM 2019: Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science

Springer, 2020

This volume comprises the proceedings of the 13th International Conference on Parallel Processing and Applied Mathematics (PPAM 2019), which was held inBiałystok, Poland, September 8–11, 2019. It was organized by the Department of Computer and Information Science of the Częstochowa University of Technology together with Białystok University of Technology, under the patronage of the Committee ...

Added: October 14, 2020

Development of a decision support system based on neural networks and a genetic algorithm

Oleg E. Bukharov, Dmitry P. Bogolyubov, Expert Systems with Applications 2015 Vol. 42 No. 15-16 P. 6177-6183

Given ever increasing information volume and complexity of engineering, social and economic systems, it has become more difficult to assess incoming data and manage such systems properly. Currently developed innovative decision support systems (DSS) aim to achieve optimum results while minimizing the risks of serious losses. The purpose of the DSS is to help the ...

Added: May 17, 2015

Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

Choi Y. R., Nikolskiy V., Stegailov V., , in : 2020 Global Smart Industry Conference (GloSIC). : IEEE, 2020. P. 354-361.

Added: December 3, 2020

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Choi Y. R., Nikolskiy V., Stegailov V., , in : Parallel Computational Technologies: 16th International Conference, PCT 2022, Dubna, Russia, March 29–31, 2022, Revised Selected Papers. : Springer, 2022. Ch. 12. P. 158-171.

Added: August 11, 2022

Algorithm for the replica redistribution in the implementation of parallel annealing method on the hybrid supercomputer architecture

Russkov A., Roman Chulkevich, Shchur L., / Cornell University. Series arXiv "math". 2020. No. 2006.00561.

The parallel annealing method is one of the promising approaches for large scale simulations as potentially scalable on any parallel architecture. We present an implementation of the algorithm on the hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible redistributing replicas and ...

Added: June 2, 2020

Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIe

Choi Y. R., Stegailov V., , in : 22nd International Conference, MMST 2022, Nizhny Novgorod, Russia, November 14–17, 2022, Revised Selected Papers. : Springer, 2022. Ch. 23. P. 281-292.

Modern types of multi-GPU servers combine up to 8 A100 GPUs connected by NVLink 3.0 links through NVSwitch. This connectivity provides unprecedented capabilities for multi-GPU algorithms. In this work, we analyze the performance of matrix-matrix multiplication algorithm developed by us previously. Tuning principles and limits for maximum performance are discussed. Algorithm performance for much more ...

Added: December 26, 2022

Распараллеленная самообучающаяся система поддержки принятия решений на генетических алгоритмах и нейронных сетях

Bukharov O., Bogolyubov D., Системный администратор 2014 № 9 С. 88-92

This paper describes aspects of development of decision support system based on neural networks and a genetic algorithm. We justify the use of general-purpose computing on graphics processing units (GPGPU) for our decision support system. Example of CUDA successful application to increase computing performance of the system in question is presented. ...

Added: September 12, 2014

Integrating GPGPU computations with CPU coroutines in C++

Lebedev P. A., Journal of Physics: Conference Series 2016 Vol. 681 No. 1 P. 012048-1-012048-6

We present results on integration of two major GPGPU APIs with reactor-based event processing model in C++ that utilizes coroutines. With current lack of universally usable GPGPU programming interface that gives optimal performance and debates about the style of implementing asynchronous computing in C++, we present a working implementation that allows a uniform and seamless ...

Added: February 3, 2016

Библиотека PRAND: генерация параллельных потоков случайных чисел для расчетов Монте-Карло с использованием GPU

Бараш Л. Ю., Shchur L., Cuda Альманах 2014 № 3 С. 17-17

Libraries RNGSSELIB и PRAND for the parallel generation of pseudo-random numbers in Monte Carlo simulations was developed. RNGSSELIB library contains realization based on the SSE extensionin the modern CPU, and PRAND library contains the generators using CUDA version 5.0 and later. ...

Added: March 10, 2016

Инструменты анализа и разработки эффективного кода для параллельных архитектур

Монаков А. В., Платонов В. А., Avetisyan A., Труды Института системного программирования РАН 2014 Т. 26 № 1 С. 357-374

The article proposes methods for supporting development of efficient programs for modern parallel architectures, including hybrid systems. First, specialized profiling methods designed for programmers tasked with parallelizing existing code are proposed. The first method is loop-based profiling via source-level instrumentation done with Coccinelle tool. The second method is memory reuse distance estimation via virtual memory ...

Added: March 22, 2017

Implementation of an XSL block cipher with MDS-matrix linear transformation on NVIDIA CUDA

Fomin D., Математические вопросы криптографии 2015 Vol. 6 No. 2 P. 99-108

In this article we consider NVIDIA GPU implementation aspects of an XSL block cipher over the finite field with MDS-matrix linear transformation. We compare obtained results with some other block ciphers. ...

Added: May 4, 2019

Приведение плотных матриц с элементами из GF(2) к ступенчатому виду на платформе NVIDIA CUDA

Lebedev P. A., Вестник Московского государственного технического университета им. Н.Э. Баумана. Серия Естественные науки 2013 № 1 (48) С. 50-60

An approach is described to implementation of the Method of Four Russians for reducing the dense matrices over GF(2) to row echelon form using the NVIDIA CUDA platform. Estimates of the algorithm running time and recommendations on choosing the algorithm parameters are given. It is shown that the developed implementation is most effective in comparison ...

Added: April 1, 2013

Разработка оболочки системы поддержки принятия решений с использованием эволюционных алгоритмов

Bukharov O., Mizikin A. A., Bogolyubov D., Промышленные АСУ и контроллеры 2013 № 7 С. 37-45

In this article we ground some advantages of the evolutionary approach to the solution of problems of decision support system development. The most popular methods of forecasting and detection of dependences are considered. Advantages of use of neural networks to forecast and to determine of dependences between parameters of systems are given. Advantages of interval ...

Added: November 29, 2013

Производительность современных вычислительных платформ в расчетах молекулярной динамики белок - мембранных систем

Nolde D., Krylov N., Телегин П. Н. et al., Труды НИИСИ РАН 2018 Т. 7 № 4 С. 157-161

The performance of molecular dynamics software package Gromacs was measured on various hardware: desktop computers, clusters based on x84_64 processors or many integrated core processors, and heterogeneous system with gaming graphic cards or general purpose GPU systems. The optimal choice of hardware for molecular dynamics simulations is discussed. ...

Added: February 10, 2020

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

Kondratyuk N., Nikolskiy V., Pavlov D. et al., International Journal of High Performance Computing Applications 2021 Vol. 35 No. 4 P. 312-324

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management ...

Added: June 25, 2021

Algorithm for replica redistribution in an implementation of the population annealing method on a hybrid supercomputer architecture

Russkov A., Chulkevich R., Shchur L., Computer Physics Communications 2021 Vol. 261 P. 107786

The population annealing method is a promising approach for large-scale simulations because it is potentially scalable on any parallel architecture. We present an implementation of the algorithm on a hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible by efficiently redistributing replicas. ...

Added: December 28, 2020

Использование технологии CUDA в обучении сверточной нейросети для распознавания пыльцевых зерен

Замятина Елена Борисовна, Ханжина Н. Е., В кн. : Высокопроизводительные вычисления на графических процессорах: материалы III Всерос. науч.-практ. конф. с междунар. участием с элементами науч. шк. для молодежи (ВВГП–2016). : Пермь : Пермский государственный национальный исследовательский университет, 2016. С. 70-81.

In this work, we describe the problem of automated pollen recognition using images from lighting microscope. Automated pollen recognition related to such important tasks as honey quality control, air quality control for helping to asthma and allergy patients, paleopalynology, forensic palynology. We describe the problem solution based on machine learning and CUDA. Extracted features and ...

Added: March 12, 2017

Algorithm for Adaptive Mesh Redistribution in Lattice Boltzmann Simulations

Ziganurova L., Shchur L., Lobachevskii Journal of Mathematics 2022 Vol. 43 No. 2 P. 513-518

The Lattice Boltzmann method (LBM) is the alternative approach for hydrodynamic equation solving. Two factors make it a favorite approach nowadays. Firstly, the attractive feature of LBM is that it is intrinsic for parallel simulations due to the linear structure of the algorithm. Secondly, what makes LBM special for the research, it is well applicable to the simulations ...

Added: May 25, 2022

Gaze Tracking Acceleration using CUDA Technology

Gostev I. M., Sibirtseva E. A., RUDN Journal of Mathematics, Information Sciences and Physics 2014 No. 4 P. 68-84

Low-cost gaze tracking systems are in great demand due to their wide range of application. Commonly, extra devices are needed (for instance, head mounted cameras); however, in this investigation gaze tracking is performed in real-time based on the video stream from an infrared video camera. A comparative analysis of the existing analogues was executed and ...

Added: December 7, 2014

Parallel algorithms for reducing derivation time of distinguishing experiments for nondeterministic finite state machines

El-Fakih K., Barlas G., Ali M. et al., International Journal of Parallel, Emergent and Distributed Systems 2018 Vol. 33 No. 2 P. 197-210

Many approaches have been proposed for deriving tests from finite state machine (FSM) specifications with respect to some established coverage criteria. A fundamental core problem in FSM-based testing relates to the derivation of input sequences that can distinguish states of an FSM specification, aka distinguishing sequences. A major effort in the construction of these sequences ...

Added: October 31, 2018

Performance and portability of state-of-art molecular dynamics software on modern GPUs

Kuznetsov E., Kondratyuk N., Logunov M. et al., , in : PPAM 2019: Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science. Vol. 12043: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I.: Springer, 2020. P. 324-334.

Classical molecular dynamics (MD) calculations represent a significant part of utilization time of high performance computing systems. As usual, efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed MD packages focused on GPUs differ both in their data management capabilities and ...

Added: October 14, 2020