A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute
Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR a...
Saved in:
Published in | IEEE journal of solid-state circuits Vol. 54; no. 6; pp. 1789 - 1799 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.06.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large-scale matrix-vector multiplications, which dominate in deep neural networks (DNNs), are limited by data movement in modern VLSI technologies. This paper addresses data movement via an in-memory-computing accelerator that employs charged-domain mixed-signal operation for enhancing compute SNR and, thus, scalability. The architecture supports analog/binary input activation (IA)/weight first layer (FL) and binary/binary IA/weight hidden layers (HLs), with batch normalization and input-output (IO) (buffering) circuitry to enable cascading, if desired, for realizing different DNN layers. The architecture is arranged as <inline-formula> <tex-math notation="LaTeX">8\times 8=64 </tex-math></inline-formula> in-memory-computing neuron tiles, supporting up to 512, <inline-formula> <tex-math notation="LaTeX">3\times 3\times 512 </tex-math></inline-formula>-input HL neurons and 64, <inline-formula> <tex-math notation="LaTeX">3\times 3\times 3 </tex-math></inline-formula>-input FL neurons, configurable via tile-level clock gating. In-memory computing is achieved using an 8T bit cell with overlaying metal-oxide-metal (MOM) capacitor, yielding a structure having <inline-formula> <tex-math notation="LaTeX">1.8\times </tex-math></inline-formula> the area of a standard 6T bit cell. Implemented in 65-nm CMOS, the design achieves HLs/FL energy efficiency of 866/1.25 TOPS/W and throughput of 18876/43.2 GOPS (1498/3.43 GOPS/mm 2 ), when implementing convolution layers; and 658/0.95 TOPS/W, 9438/10.47 GOPS (749/0.83 GOPS/mm 2 ), when implementing convolution followed by batch normalization layers. Several large-scale neural networks are demonstrated, showing performance on standard benchmarks (MNIST, CIFAR-10, and SVHN) equivalent to ideal digital computing. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 0018-9200 1558-173X |
DOI: | 10.1109/JSSC.2019.2899730 |