A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute

Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR a...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of solid-state circuits Vol. 54; no. 6; pp. 1789 - 1799
Main Authors Valavi, Hossein, Ramadge, Peter J., Nestler, Eric, Verma, Naveen
Format Journal Article
LanguageEnglish
Published New York IEEE 01.06.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability. The architecture supports analog/binary input activation (IA)/weight first layer (FL) and binary/binary IA/weight hidden layers (HLs), with batch normalization and input-output (IO) (buffering) circuitry to enable cascading, if desired, for realizing different DNN layers. The architecture is arranged as <inline-formula> <tex-math notation="LaTeX">8\times 8=64 </tex-math></inline-formula> in-memory-computing neuron tiles, supporting up to 512, <inline-formula> <tex-math notation="LaTeX">3\times 3\times 512 </tex-math></inline-formula>-input HL neurons and 64, <inline-formula> <tex-math notation="LaTeX">3\times 3\times 3 </tex-math></inline-formula>-input FL neurons, configurable via tile-level clock gating. In-memory computing is achieved using an 8T bit cell with overlaying metal-oxide-metal (MOM) capacitor, yielding a structure having <inline-formula> <tex-math notation="LaTeX">1.8\times </tex-math></inline-formula> the area of a standard 6T bit cell. Implemented in 65-nm CMOS, the design achieves HLs/FL energy efficiency of 866/1.25 TOPS/W and throughput of 18876/43.2 GOPS (1498/3.43 GOPS/mm 2 ), when implementing convolution layers; and 658/0.95 TOPS/W, 9438/10.47 GOPS (749/0.83 GOPS/mm 2 ), when implementing convolution followed by batch normalization layers. Several large-scale neural networks are demonstrated, showing performance on standard benchmarks (MNIST, CIFAR-10, and SVHN) equivalent to ideal digital computing.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2019.2899730