A 510-nW Wake-Up Keyword-Spotting Chip Using Serial-FFT-Based MFCC and Binarized Depthwise Separable CNN in 28-nm CMOS

We propose a sub-<inline-formula> <tex-math notation="LaTeX">\mu \text{W} </tex-math></inline-formula> always-ON keyword spotting (<inline-formula> <tex-math notation="LaTeX">\mu </tex-math></inline-formula>KWS) chip for audio wake-...

Full description

Saved in:
Bibliographic Details
Published inIEEE journal of solid-state circuits Vol. 56; no. 1; pp. 151 - 164
Main Authors Shan, Weiwei, Yang, Minhao, Wang, Tao, Lu, Yicheng, Cai, Hao, Zhu, Lixuan, Xu, Jiaming, Wu, Chengjun, Shi, Longxing, Yang, Jun
Format Journal Article
LanguageEnglish
Published New York IEEE 01.01.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:We propose a sub-<inline-formula> <tex-math notation="LaTeX">\mu \text{W} </tex-math></inline-formula> always-ON keyword spotting (<inline-formula> <tex-math notation="LaTeX">\mu </tex-math></inline-formula>KWS) chip for audio wake-up systems. It is mainly composed of a neural network (NN) and a feature extraction (FE) circuit. For significantly reducing the memory footprint and computational load, four techniques are used to achieve ultra-low-power consumption: 1) a serial-FFT-based Mel-frequency cepstrum coefficient circuit is designed for FE, instead of the common parallel FFT. 2) A small-sized binarized depthwise separable convolutional NN (DSCNN) is designed as the classifier. 3) A framewise incremental computation technique is devised in contrast to the conventional whole-word processing. 4) Reduced computation allows a low system clock frequency, which enables near-threshold voltage operation, and low leakage memory blocks are designed to minimize the leakage power. Implemented in 28-nm CMOS technology, this <inline-formula> <tex-math notation="LaTeX">\mu </tex-math></inline-formula>KWS consumes <inline-formula> <tex-math notation="LaTeX">0.51~\mu \text{W} </tex-math></inline-formula> at a 40-kHz frequency and a 0.41-V supply, with an area of 0.23 mm 2 . Using the Google speech command data set, 97.3% accuracy is reached for a one-word KWS task and 94.6% for a two-word task.
ISSN:0018-9200
1558-173X
DOI:10.1109/JSSC.2020.3029097