Exact Dot Product Accumulate Operators for 8-bit Floating-Point Deep Learning

Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponen...

Full description

Saved in:

Bibliographic Details
Published in	2023 26th Euromicro Conference on Digital System Design (DSD) pp. 642 - 649
Main Authors	Desrentes, Oregane, de Dinechin, Benoit Dupont, Le Maire, Julien
Format	Conference Proceeding
Language	English
Published	IEEE 06.09.2023
Subjects	Costs Deep learning Digital systems FP8 Posit8
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponent bits and two fraction bits (E5M2), FP8 with four exponent bits and three fraction bits (E4M3), and Posit8 formats with different exponent sizes. The front-ends of these exact dot product accumulate operators take 8-bit multiplicands, expand their full-precision products to fixed-point, and sum terms into wide accumulators. The back-ends of these operators round down the wide accumulators contents first to FP32 and then to one of the 8-bit floating-point formats. We synthesize the proposed 8-bit floating-point exact dot product accumulate operators targeting the TSMC 16FFC node and compare their area and power to a baseline of operators with FP16 and INT8 multiplicands.
ISSN:	2771-2508
DOI:	10.1109/DSD60849.2023.00093