SU3_Bench on a Programmable Integrated Unified Memory Architecture (PIUMA) and How that Differs from Standard NUMA CPUs

SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the (LQCD) code used in applications such as Hadron Physics and hence should be of interest to the scientific community. SU3_Bench has a...

Full description

Saved in:
Bibliographic Details
Published inHigh Performance Computing pp. 65 - 84
Main Authors Tithi, Jesmin Jahan, Checconi, Fabio, Doerfler, Douglas, Petrini, Fabrizio
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:SU3_Bench explores performance portability across multiple programming models using a simple but nontrivial mathematical kernel. This kernel has been derived from the (LQCD) code used in applications such as Hadron Physics and hence should be of interest to the scientific community. SU3_Bench has a regular compute and data access pattern and on most traditional CPU and GPU-based systems, its performance is mainly determined by the achievable memory bandwidth. However, this paper shows that on the new Intel Programmable Integrated Unified Memory Architecture (PIUMA) that is designed for sparse workloads and has a balanced flops-to-byte ratio with scalar cores, SU3_Bench’s performance is determined by the total number of instructions that can be executed per cycle (pipeline throughput) rather than the usual bandwidth or flops. We show the performance analysis, porting, and optimizations of SU3_Bench on the PIUMA architecture and discuss how they are different from the standard NUMA CPUs (e.g., Xeon required NUMA optimizations whereas, on PIUMA, it was not necessary). We show iso-bandwidth and iso-power comparisons of SU3_Bench for PIUMA vs Xeon. We also show performance efficiency comparisons of SU3_Bench on PIUMA, Xeon, GPUs, and FPGAs based on pre-existing data. The lessons learned are generalizable to other similar kernels.
ISBN:3031073118
9783031073113
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-031-07312-0_4