Evaluating OpenMP, OpenACC and CUDA parallel programming models for the GPU: Performance Analysis
Modern supercomputers use GPUs as accelerators in computing nodes. GPUs allow scientific applications to greatly boost performance using fine-grained parallelism. CUDA programming model oriented to take advantage of the SIMT GPU architecture writing low-level code. Contrary to this approach, OpenACC and OpenMP 4.5 represent a declarative model of parallel programming using compiler pragmas with support of GPU offloading. In this paper the efficiency of matrix multiplication using these programming models is considered. A comparative analysis of the performance of naive and hand tuned matrix multiplication on Nvidia Tesla V100 and MX940 GPUs and modern CPUs is carried out. Analysis of vendor-optimized BLAS libraries is also present.