A Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy for AI Accelerating

Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator with AI applications ranging from filtering to convolutional neural networks(CNN). This paper proposes an energy-efficient MAC design, supporting a wide range of bit- width, for both signed and unsigned operan...

Full description

Saved in:
Bibliographic Details
Published in2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC) pp. 229 - 234
Main Authors Zhang, Song, Gu, Jiangyuan, Yin, Shouyi, Liu, Leibo, Wei, Shaojun
Format Conference Proceeding
LanguageEnglish
Published ACM 18.01.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Multiply and accumulations(MAC) are fundamental operations for domain-specific accelerator with AI applications ranging from filtering to convolutional neural networks(CNN). This paper proposes an energy-efficient MAC design, supporting a wide range of bit- width, for both signed and unsigned operands. Firstly, based on the classic Booth algorithm, we propose the Booth algorithm to propose a multiply-add merged strategy. The design can not only support both signed and unsigned operations but also eliminate the delay, area and power overheads from the adder of traditional MAC units. Then a multiply-add merged design method for flexible bit-width adjustment is proposed using the fusion strategy. In addition, treating the addend as a partial product makes the operation easy to pipeline and balanced. The comprehensive improvement in delay, area and power can meet various requirements from different applications and hardware design. By using the proposed method, we have synthesized MAC units for several operation modes using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width fixed MAC designs, and 28.43% ~ 38.16% for bit-width adjustable ones. When pipelined, the design has decreased the latency by more than 13%. The improvement in power and area is up to 8.0% and 8.1% respectively.
ISSN:2153-697X
DOI:10.1145/3394885.3431531