Computing large 2D convolutions on GPU efficiently with the im2tensor algorithm

Attaining the best possible throughput when computing convolutions is a challenge for signal and image processing systems, be they HPC (High-Performance Computing) machines or embedded real-time targets. This importance is highlighted by the numerous methods and implementations available, often opti...

Full description

Saved in:

Bibliographic Details
Published in	Journal of real-time image processing Vol. 19; no. 6; pp. 1035 - 1047
Main Authors	Seznec, Mickaël, Gac, Nicolas, Orieux, François, Sashala Naik, Alvin
Format	Journal Article
Language	English
Published	Berlin/Heidelberg Springer Berlin Heidelberg 01.12.2022 Springer Nature B.V Springer Verlag
Subjects	Algorithms Computer Graphics Computer Science Embedded systems Engineering Sciences Floating point arithmetic Fourier transforms Graphics processing units Image processing Image Processing and Computer Vision Linear algebra Mathematical analysis Methods Multimedia Information Systems Neural networks Original Research Paper Pattern Recognition Pixels Real time Signal and Image processing Signal processing Signal,Image and Speech Processing Tensors Two dimensional analysis Image convolution GPU optimisation GPU tensor cores Hardware acceleration Image processing systems Hardware Acceleration Image Processing Systems GPU Optimisation GPU Tensor Cores Image Convolution
Online Access	Get full text
ISSN	1861-8200 1861-8219
DOI	10.1007/s11554-022-01240-0

Cover

More Information
Summary:	Attaining the best possible throughput when computing convolutions is a challenge for signal and image processing systems, be they HPC (High-Performance Computing) machines or embedded real-time targets. This importance is highlighted by the numerous methods and implementations available, often optimized for particular settings: small batched kernels or very large kernels, for example. In the meantime, GPUs (Graphics Processing Units) have become a first-class architecture for real-time and embedded processing. The power offered by those chips stems from their parallel nature, and this advantage has been exploited for convolutions in several libraries. Even more recently, the introduction of tensor cores on NVIDIA GPUs has opened up new limits in terms of attainable FLOPS (Floating-Point Operations per Second). For reaching that performance, GPU applications must use GEMMs (GEneral Matrix Multiplications), that tensor cores accelerate. We then developed an efficient GEMM-based 2D convolution algorithm in a general setting. On relatively large kernels (30–50-pixel wide), im2tensor is, to the best of our knowledge, the fastest method for computing 2D convolutions. We provide detailed performance analysis for different scenarios: small (1024 × 1024) and large (4096 × 4096) images, with convolutions kernels of sizes ranging 1 to 60-pixel wide, on two GPU cards: Jetson AGX Xavier (embedded) and Titan V (server-class). Moreover, the accuracy of im2tensor surpasses non-GEMM based methods, thanks to the larger-precision registers used by tensor cores for intermediate representations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1861-8200 1861-8219
DOI:	10.1007/s11554-022-01240-0