Exact Dot Product Accumulate Operators for 8-bit Floating-Point Deep Learning

Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponen...

Full description

Saved in:
Bibliographic Details
Published in2023 26th Euromicro Conference on Digital System Design (DSD) pp. 642 - 649
Main Authors Desrentes, Oregane, de Dinechin, Benoit Dupont, Le Maire, Julien
Format Conference Proceeding
LanguageEnglish
Published IEEE 06.09.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponent bits and two fraction bits (E5M2), FP8 with four exponent bits and three fraction bits (E4M3), and Posit8 formats with different exponent sizes. The front-ends of these exact dot product accumulate operators take 8-bit multiplicands, expand their full-precision products to fixed-point, and sum terms into wide accumulators. The back-ends of these operators round down the wide accumulators contents first to FP32 and then to one of the 8-bit floating-point formats. We synthesize the proposed 8-bit floating-point exact dot product accumulate operators targeting the TSMC 16FFC node and compare their area and power to a baseline of operators with FP16 and INT8 multiplicands.
ISSN:2771-2508
DOI:10.1109/DSD60849.2023.00093