Computing large 2D convolutions on GPU efficiently with the im2tensor algorithm
Attaining the best possible throughput when computing convolutions is a challenge for signal and image processing systems, be they HPC (High-Performance Computing) machines or embedded real-time targets. This importance is highlighted by the numerous methods and implementations available, often opti...
Saved in:
Published in | Journal of real-time image processing Vol. 19; no. 6; pp. 1035 - 1047 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.12.2022
Springer Nature B.V Springer Verlag |
Subjects | |
Online Access | Get full text |
ISSN | 1861-8200 1861-8219 |
DOI | 10.1007/s11554-022-01240-0 |
Cover
Summary: | Attaining the best possible throughput when computing convolutions is a challenge for signal and image processing systems, be they HPC (High-Performance Computing) machines or embedded real-time targets. This importance is highlighted by the numerous methods and implementations available, often optimized for particular settings: small batched kernels or very large kernels, for example. In the meantime, GPUs (Graphics Processing Units) have become a first-class architecture for real-time and embedded processing. The power offered by those chips stems from their parallel nature, and this advantage has been exploited for convolutions in several libraries. Even more recently, the introduction of tensor cores on NVIDIA GPUs has opened up new limits in terms of attainable FLOPS (Floating-Point Operations per Second). For reaching that performance, GPU applications must use GEMMs (GEneral Matrix Multiplications), that tensor cores accelerate. We then developed an efficient GEMM-based 2D convolution algorithm in a general setting. On relatively large kernels (30–50-pixel wide),
im2tensor
is, to the best of our knowledge, the fastest method for computing 2D convolutions. We provide detailed performance analysis for different scenarios: small (1024
×
1024) and large (4096
×
4096) images, with convolutions kernels of sizes ranging 1 to 60-pixel wide, on two GPU cards: Jetson AGX Xavier (embedded) and Titan V (server-class). Moreover, the accuracy of
im2tensor
surpasses non-GEMM based methods, thanks to the larger-precision registers used by tensor cores for intermediate representations. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1861-8200 1861-8219 |
DOI: | 10.1007/s11554-022-01240-0 |