Exact Dot Product Accumulate Operators for 8-bit Floating-Point Deep Learning
Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponen...
Saved in:
Published in | 2023 26th Euromicro Conference on Digital System Design (DSD) pp. 642 - 649 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
06.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Low bit-width floating-point formats appear as the main alternative to 8-bit integers for quantized deep learning applications. We propose an architecture for exact dot product accumulate operators and compare its implementation costs for different 8-bit floating-point formats: FP8 with five exponent bits and two fraction bits (E5M2), FP8 with four exponent bits and three fraction bits (E4M3), and Posit8 formats with different exponent sizes. The front-ends of these exact dot product accumulate operators take 8-bit multiplicands, expand their full-precision products to fixed-point, and sum terms into wide accumulators. The back-ends of these operators round down the wide accumulators contents first to FP32 and then to one of the 8-bit floating-point formats. We synthesize the proposed 8-bit floating-point exact dot product accumulate operators targeting the TSMC 16FFC node and compare their area and power to a baseline of operators with FP16 and INT8 multiplicands. |
---|---|
ISSN: | 2771-2508 |
DOI: | 10.1109/DSD60849.2023.00093 |