A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer Internet of Things (IoT) market. As these technologies are being applied to keyword spotting (KWS), automatic speech rec...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of solid-state circuits Vol. 58; no. 2; pp. 569 - 581
Main Authors Tambe, Thierry, Yang, En-Yu, Ko, Glenn G., Chai, Yuji, Hooper, Coleman, Donato, Marco, Whatmough, Paul N., Rush, Alexander M., Brooks, David, Wei, Gu-Yeon
Format Journal Article
LanguageEnglish
Published New York IEEE 01.02.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The proliferation of personal artificial intelligence (AI) -assistant technologies with speech-based conversational AI interfaces is driving the exponential growth in the consumer Internet of Things (IoT) market. As these technologies are being applied to keyword spotting (KWS), automatic speech recognition (ASR), natural language processing (NLP), and text-to-speech (TTS) applications, it is of paramount importance that they provide uncompromising performance for context learning in long sequences, which is a key benefit of the attention mechanism, and that they work seamlessly in polyphonic environments. In this work, we present a 25-mm2 system-on-chip (SoC) in 16-nm FinFET technology, codenamed SM6, which executes end-to-end speech-enhancing attention-based ASR and NLP workloads. The SoC includes: 1) FlexASR, a highly reconfigurable NLP inference processor optimized for whole-model acceleration of bidirectional attention-based sequence-to-sequence (seq2seq) deep neural networks (DNNs); 2) a Markov random field source separation engine (MSSE), a probabilistic graphical model accelerator for unsupervised inference via Gibbs sampling, used for sound source separation; 3) a dual-core Arm Cortex A53 CPU cluster, which provides on-demand single Instruction/multiple data (SIMD) fast fourier transform (FFT) processing and performs various application logic (e.g., expectation-maximization (EM) algorithm and 8-bit floating-point (FP8) quantization); and 4) an always-ON M0 subsystem for audio detection and power management. Measurement results demonstrate the efficiency ranges of 2.6-7.8 TFLOPs/W and 4.33-17.6 Gsamples/s/W for FlexASR and MSSE, respectively; MSSE denoising performance allowing 6<inline-formula> <tex-math notation="LaTeX">\times </tex-math></inline-formula> smaller ASR model to be stored on-chip with negligible accuracy loss; and 2.24-mJ energy consumption while achieving real-time throughput, end-to-end, and per-frame ASR latencies of 18 ms.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2022.3179303