DiagNNose: Towards Error Localization in Deep Learning Hardware based on VTA-TVM Stack

Low-level hardware faults manifested in a Deep learning (DL) accelerator usher in graceless degradation of high-level classification accuracy, which can eventuate to catastrophic circumstances. This violates the crucial Functional Safety (FuSa) of the DL accelerator, maintaining which is imperative...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computer-aided design of integrated circuits and systems p. 1
Main Authors Kundu, Shamik, Banerjee, Suvadeep, Raha, Arnab, Natarajan, Suriyaprakash, Basu, Kanad
Format Journal Article
LanguageEnglish
Published IEEE 08.08.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Low-level hardware faults manifested in a Deep learning (DL) accelerator usher in graceless degradation of high-level classification accuracy, which can eventuate to catastrophic circumstances. This violates the crucial Functional Safety (FuSa) of the DL accelerator, maintaining which is imperative in high assurance applications. Conventional techniques for error localization incur high test efforts, without regards to the unique challenges posed by DL systems. In this direction, we propose DiagNNose, a two-tier machine learning-based error localization framework for on-line fault management in DL accelerators. We develop a novel diagnostic pattern selection algorithm to obtain a minimal subset of functional test patterns, that are executed in the accelerator in mission mode. By extracting and analyzing dataflow-based features from the intermediate computations of the General Matrix Multiply (GEMM) core, a lightweight multi-layer perceptron accomplishes bit-level error localization in 8-bit, 16-bit and 32-bit datapath units with high fidelity. We have limited ourselves to a single accelerator design, i.e., the Versatile Tensor Accelerator (VTA) architecture to evaluate our proposed DiagNNose framework. On executing state-of-the-art deep neural networks trained on ImageNet; error localization using only 30 diagnostic functional test patterns demonstrate up to 98.4% diagnosability, thereby demonstrating an improvement of 54.63% over a random test pattern set, with as low as 4.95% overhead in the DL accelerator in mission mode.
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2023.3303851