Multi-GPU Parallelization of the NAS Multi-ZoneParallel Benchmarks

GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC)domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However,exploiting the peak performance of such devices is a challeng...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on parallel and distributed systems p. 1
Main Authors	Gonzalez Tallada, Marc, Morancho, Enric
Format	Journal Article
Language	English
Published	IEEE 07.08.2020
Subjects	Benchmark testing Dynamic Dynamic scheduling Graphics processing units Guided Schedulings Load Balancing Load management Multi-GPU Parallelization Optimal scheduling Parallel processing Performance evaluation Static
Online Access	Get full text
ISSN	1045-9219
DOI	10.1109/TPDS.2020.3015148

Cover

More Information
Summary:	GPU-based computing systems have become a widely accepted solution for the high-performance-computing (HPC)domain. GPUs have shown highly competitive performance-per-watt ratios and can exploit an astonishing level of parallelism. However,exploiting the peak performance of such devices is a challenge, mainly due to the combination of two essential aspects of multi-GPUexecution. On one hand, the workload should be distributed evenly among the GPUs. On the other hand, communications betweenGPU devices are costly and should be minimized. Therefore, a trade-of between work-distribution schemes and communicationoverheads will condition the overall performance of parallel applications run on multi-GPU systems.In this paper we present a multi-GPU implementation of NAS Multi-Zone Parallel Benchmarks (which execution alternatecommunication and computational phases). We propose several work-distribution strategies that try to evenly distribute the workloadamong the GPUs. Our evaluations show that performance is highly sensitive to this distribution strategy, as the the communicationphases of the applications are heavily affected by the work-distribution schemes applied in computational phases. In particular, weconsider Static, Dynamic and Guided schedulers to find a trade-off between both phases to maximize the overall performance. Inaddition, we compare those schedulers with an optimal scheduler computed offline using IBM CPLEX.On an evaluation environment composed of 2 x IBM Power9 8335-GTH and 4 x GPU NVIDIA V100 (Volta), our multi-GPUparallelization outperforms single-GPU execution from 1.48x to 1.86x (2 GPUs) and from 1.75x to 3.54x (4 GPUs).
ISSN:	1045-9219
DOI:	10.1109/TPDS.2020.3015148