Performance and portability of state-of-art molecular dynamics software on modern GPUs

Kuznetsov E.; N. Kondratyuk; Logunov M.; V. Nikolskiy; V. Stegailov

doi:10.1007/978-3-030-43229-4_28

Publications

?

Performance and portability of state-of-art molecular dynamics software on modern GPUs

P. 324–334.

Kuznetsov E., Kondratyuk N., Logunov M., Nikolskiy V., Stegailov V.

Classical molecular dynamics (MD) calculations represent a significant part of utilization time of high performance computing systems. As usual, efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed MD packages focused on GPUs differ both in their data management capabilities and in performance. In this paper, we present our results for the porting of the CUDA backend of LAMMPS to ROCm HIP that shows considerable benefits for AMD GPUs comparatively to the existing OpenCL backend. We consider the efficiency of solving the same physical models using different software and hardware combinations. We analyze the performance of LAMMPS, HOOMD, GROMACS and OpenMM MD packages with different GPU back-ends on modern Nvidia Volta and AMD Vega20 GPUs.

Language: English

DOI

Text on another site

Keywords: CUDA OpenCL Gromacs AMD ROCm HIP HOOMD OpenMM

Publication based on the results of:

Methods for the analysis of the supercomputer efficiency, novel parallel algorithms for molecular dynamics calculations and modeling of transport processes in liquids and biomembranes (2020)

In book

PPAM 2019: Parallel Processing and Applied Mathematics. Lecture Notes in Computer Science

Vol. 12043: 13th International Conference, PPAM 2019, Bialystok, Poland, September 8–11, 2019, Revised Selected Papers, Part I. , Springer, 2020.

GEMM Algorithm for Multi-GPU Platforms with Regular Uneven Data Transfer Links

Choi Y. R., Malkovsky S., Stegailov V., , in: 11th Russian Supercomputing Days, RuSCDays 2025, Moscow, Russia, September 29–30, 2025, Revised Selected Papers.: Springer, 2026. Ch. 3 P. 32–47.

Multi-GPU servers often exhibit uneven characteristics. For instance, the data transfer bandwidth between four NVIDIA V100 GPUs can vary due to the NVLink connecting these devices to a specific CPU in servers with IBM POWER 9 processors, which means that the communication bandwidth between other devices is comparably slower. To address this issue, the Multi-GPU ...

Added: January 3, 2026

GPU-based molecular dynamics of fluid flows: Reaching for turbulence

Pavlov D., Galigerov V., Kolotinskii D. et al., International Journal of High Performance Computing Applications 2024 Vol. 38 No. 1 P. 34–49

Fluid dynamics is a ubiquitous problem that arises in different branches of science and industry. It is usually tackled by numerically solving differential equations on a finite grid. Molecular dynamics was not a feasible tool to approach fluid dynamics until very recently due to its disproportional computational complexity. In this paper we propose a new ...

Added: July 18, 2023

Multi-GPU GEMM Algorithm Performance Analysis for Nvidia and AMD GPUs Connected by NVLink and PCIe

Choi Y. R., Stegailov V., , in: 22nd International Conference, MMST 2022, Nizhny Novgorod, Russia, November 14–17, 2022, Revised Selected Papers.: Springer, 2022. Ch. 23 P. 281–292.

Modern types of multi-GPU servers combine up to 8 A100 GPUs connected by NVLink 3.0 links through NVSwitch. This connectivity provides unprecedented capabilities for multi-GPU algorithms. In this work, we analyze the performance of matrix-matrix multiplication algorithm developed by us previously. Tuning principles and limits for maximum performance are discussed. Algorithm performance for much more ...

Added: December 26, 2022

Tuning of a Matrix-Matrix Multiplication Algorithm for Several GPUs Connected by Fast Communication Links

Choi Y. R., Nikolskiy V., Stegailov V., , in: Parallel Computational Technologies: 16th International Conference, PCT 2022, Dubna, Russia, March 29–31, 2022, Revised Selected Papers.: Springer, 2022. Ch. 12 P. 158–171.

Added: August 11, 2022

Algorithm for Adaptive Mesh Redistribution in Lattice Boltzmann Simulations

Ziganurova L., Shchur L., Lobachevskii Journal of Mathematics 2022 Vol. 43 No. 2 P. 513–518

The Lattice Boltzmann method (LBM) is the alternative approach for hydrodynamic equation solving. Two factors make it a favorite approach nowadays. Firstly, the attractive feature of LBM is that it is intrinsic for parallel simulations due to the linear structure of the algorithm. Secondly, what makes LBM special for the research, it is well applicable to the simulations ...

Added: May 25, 2022

GPU-accelerated molecular dynamics: State-of-art software performance and porting from Nvidia CUDA to AMD HIP

Kondratyuk N., Nikolskiy V., Pavlov D. et al., International Journal of High Performance Computing Applications 2021 Vol. 35 No. 4 P. 312–324

Classical molecular dynamics (MD) calculations represent a significant part of the utilization time of high-performance computing systems. As usual, the efficiency of such calculations is based on an interplay of software and hardware that are nowadays moving to hybrid GPU-based technologies. Several well-developed open-source MD codes focused on GPUs differ both in their data management ...

Added: June 25, 2021

Algorithm for replica redistribution in an implementation of the population annealing method on a hybrid supercomputer architecture

Russkov A., Chulkevich R., Shchur L., Computer Physics Communications 2021 Vol. 261 P. 107786

The population annealing method is a promising approach for large-scale simulations because it is potentially scalable on any parallel architecture. We present an implementation of the algorithm on a hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible by efficiently redistributing replicas. ...

Added: December 28, 2020

Matrix-Matrix Multiplication Using Multiple GPUs Connected by Nvlink

Choi Y. R., Nikolskiy V., Stegailov V., , in: 2020 Global Smart Industry Conference (GloSIC).: IEEE, 2020. P. 354–361.

Added: December 3, 2020

Evaluating OpenMP, OpenACC and CUDA parallel programming models for the GPU: Performance Analysis

Timofeev A., Khalilov M., , in: Параллельные вычислительные технологии (ПаВТ'2020).: Chelyabinsk: ., 2020. P. 40–51.

Modern supercomputers use GPUs as accelerators in computing nodes. GPUs allow scientific applications to greatly boost performance using fine-grained parallelism. CUDA programming model oriented to take advantage of the SIMT GPU architecture writing low-level code. Contrary to this approach, OpenACC and OpenMP 4.5 represent a declarative model of parallel programming using compiler pragmas with support ...

Added: October 23, 2020

Algorithm for the replica redistribution in the implementation of parallel annealing method on the hybrid supercomputer architecture

Russkov A., Roman Chulkevich, Shchur L., / Series arXiv "math". 2020. No. 2006.00561.

The parallel annealing method is one of the promising approaches for large scale simulations as potentially scalable on any parallel architecture. We present an implementation of the algorithm on the hybrid program architecture combining CUDA and MPI. The problem is to keep all general-purpose graphics processing unit devices as busy as possible redistributing replicas and ...

Added: June 2, 2020

Производительность современных вычислительных платформ в расчетах молекулярной динамики белок - мембранных систем

Nolde D., Krylov N., Телегин П. Н. et al., Труды НИИСИ РАН 2018 Т. 7 № 4 С. 157–161

The performance of molecular dynamics software package Gromacs was measured on various hardware: desktop computers, clusters based on x84_64 processors or many integrated core processors, and heterogeneous system with gaming graphic cards or general purpose GPU systems. The optimal choice of hardware for molecular dynamics simulations is discussed. ...

Added: February 10, 2020

Performance and Scalability of Materials Science and Machine Learning Codes on the State-of-Art Hybrid Supercomputer Architecture

Kondratyuk N., Smirnov G., Agarkov A. et al., , in: Supercomputing. RuSCDays 2019. Communications in Computer and Information ScienceVol. 1129: Supercomputing. RuSCDays 2019.: Springer, 2019. P. 597–609.

8 of top 10 supercomputers of Top500 list published in November 2018 consist of computing nodes with hybrid architectures that require special programming techniques. 5 systems among these are based on Nvidia GPUs. In this paper, we consider the benchmark results of the brand new hybrid supercomputer installed in March 2019 in NRU HSE. This ...

Added: December 11, 2019

Implementation of an XSL block cipher with MDS-matrix linear transformation on NVIDIA CUDA

Fomin D., Математические вопросы криптографии 2015 Vol. 6 No. 2 P. 99–108

In this article we consider NVIDIA GPU implementation aspects of an XSL block cipher over the finite field with MDS-matrix linear transformation. We compare obtained results with some other block ciphers. ...

Added: May 4, 2019

A timing attack on CUDA implementations of an AES-type block cipher

Fomin D., Математические вопросы криптографии 2016 Vol. 7 No. 2 P. 121–130

A timing attack against an AES-type block cipher CUDA implementa- tion is presented. Our experiments show that it is possible to extract a secret AES 128-bit key with complexity of 2^32 chosen plaintext encryptions. This approach may be applied to AES with other key sizes and, moreover, to any block cipher with a linear transform that is ...

Added: May 4, 2019

Parallel algorithms for reducing derivation time of distinguishing experiments for nondeterministic finite state machines

El-Fakih K., Barlas G., Ali M. et al., International Journal of Parallel, Emergent and Distributed Systems 2018 Vol. 33 No. 2 P. 197–210

Many approaches have been proposed for deriving tests from finite state machine (FSM) specifications with respect to some established coverage criteria. A fundamental core problem in FSM-based testing relates to the derivation of input sequences that can distinguish states of an FSM specification, aka distinguishing sequences. A major effort in the construction of these sequences ...

Added: October 31, 2018

Инструменты анализа и разработки эффективного кода для параллельных архитектур

Монаков А. В., Платонов В. А., Avetisyan A., Труды Института системного программирования РАН 2014 Т. 26 № 1 С. 357–374

The article proposes methods for supporting development of efficient programs for modern parallel architectures, including hybrid systems. First, specialized profiling methods designed for programmers tasked with parallelizing existing code are proposed. The first method is loop-based profiling via source-level instrumentation done with Coccinelle tool. The second method is memory reuse distance estimation via virtual memory ...

Added: March 22, 2017

Использование технологии CUDA в обучении сверточной нейросети для распознавания пыльцевых зерен

Замятина Елена Борисовна, Ханжина Н. Е., В кн.: Высокопроизводительные вычисления на графических процессорах: материалы III Всерос. науч.-практ. конф. с междунар. участием с элементами науч. шк. для молодежи (ВВГП–2016).: Пермь: Пермский государственный национальный исследовательский университет, 2016. С. 70–81.

In this work, we describe the problem of automated pollen recognition using images from lighting microscope. Automated pollen recognition related to such important tasks as honey quality control, air quality control for helping to asthma and allergy patients, paleopalynology, forensic palynology. We describe the problem solution based on machine learning and CUDA. Extracted features and ...

Added: March 12, 2017

Библиотека PRAND: генерация параллельных потоков случайных чисел для расчетов Монте-Карло с использованием GPU

Бараш Л. Ю., Shchur L., Cuda Альманах 2014 № 3 С. 17–17

Libraries RNGSSELIB и PRAND for the parallel generation of pseudo-random numbers in Monte Carlo simulations was developed. RNGSSELIB library contains realization based on the SSE extensionin the modern CPU, and PRAND library contains the generators using CUDA version 5.0 and later. ...

Added: March 10, 2016