A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference

A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a da...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE Symposium on VLSI Circuits pp. 35 - 36
Main Authors	Fleischer, Bruce, Shukla, Sunil, Ziegler, Matthew, Silberman, Joel, Jinwook Oh, Srinivasan, Vijavalakshmi, Jungwook Choi, Mueller, Silvia, Agrawal, Ankur, Babinsky, Tina, Nianzheng Cao, Chia-Yu Chen, Chuang, Pierce, Fox, Thomas, Gristede, George, Guillorn, Michael, Haynie, Howard, Klaiber, Michael, Dongsoo Lee, Shih-Hsien Lo, Maier, Gary, Scheuermann, Michael, Venkataramani, Swagath, Vezyrtzis, Christos, Naigang Wang, Fanchieh Yee, Ching Zhou, Pong-Fei Lu, Curran, Brian, Lel Chang, Gopalakrishnan, Kailash
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2018
Subjects	Bandwidth Clocks Hardware Training Very large scale integration
Online Access	Get full text
DOI	10.1109/VLSIC.2018.8502276

Cover

More Information
Summary:	A multi-TOPS AI core is presented for acceleration of deep learning training and inference in systems from edge devices to data centers. With a programmable architecture and custom ISA, this engine achieves >90% sustained utilization across the range of neural network topologies by employing a dataflow architecture and an on-chip scratchpad hierarchy. Compute precision is optimized at 16b floating point (fp 16) for high model accuracy in training and inference as well as 1b/2b (bi-nary/ternary) integer for aggressive inference performance. At 1.5 GHz, the AI core prototype achieves 1.5 TFLOPS fp 16, 12 TOPS ternary, or 24 TOPS binary peak performance in 14nm CMOS.
DOI:	10.1109/VLSIC.2018.8502276