Tightly Coupled Machine Learning Coprocessor Architecture With Analog In-Memory Computing for Instruction-Level Acceleration
Low-profile mobile computing platforms often need to execute a variety of machine learning algorithms with limited memory and processing power. To address this challenge, this work presents Coara, an instruction-level processor acceleration architecture, which efficiently integrates an approximate a...
Saved in:
Published in | IEEE journal on emerging and selected topics in circuits and systems Vol. 9; no. 3; pp. 544 - 561 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Piscataway
IEEE
01.09.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Low-profile mobile computing platforms often need to execute a variety of machine learning algorithms with limited memory and processing power. To address this challenge, this work presents Coara, an instruction-level processor acceleration architecture, which efficiently integrates an approximate analog in-memory computing coprocessor for accelerating general machine learning applications by exploiting analog register file cache. The instruction-level acceleration offers true programmability beyond the degree of freedom provided by reconfigurable machine learning accelerators, and also allows the code generation stage of a compiler back-end to control the coprocessor execution and data flow, so that applications do not need high-level machine learning software frameworks with a large memory footprint. Conventional analog and mixed-signal accelerators suffer from the overhead of frequent data conversion between analog and digital signals. To solve this classical problem, Coara uses an analog register file cache, which interfaces the analog in-memory computing coprocessor with the digital register file of the processor core. As a result, more than 90% of data conversion overhead with ADC and DAC can be eliminated by temporarily storing the result of analog computation in a switched-capacitor analog memory cell until data dependency occurs. Cycle-accurate Verilog RTL model of the proposed architecture is evaluated with 45 nm CMOS technology parameters while executing machine learning benchmark computation codes that are generated by a customized cross-compiler without using machine learning software frameworks. |
---|---|
ISSN: | 2156-3357 2156-3365 |
DOI: | 10.1109/JETCAS.2019.2934929 |