R2D3: A Reliability Engine for 3D Parallel Systems

This paper proposes a holistic reliability management engine, R2D3, for post-Moore's technology based parallel 3D systems that have low yield and high failure rate. The proposed engine, comprising of a controller, reconfigurable crossbars and detection circuitry, provides concurrent single-repl...

Full description

Saved in:
Bibliographic Details
Published in2020 57th ACM/IEEE Design Automation Conference (DAC) pp. 1 - 6
Main Authors Bagherzadeh, Javad, Amarnath, Aporva, Tan, Jielun, Pal, Subhankar, Dreslinski, Ronald G.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper proposes a holistic reliability management engine, R2D3, for post-Moore's technology based parallel 3D systems that have low yield and high failure rate. The proposed engine, comprising of a controller, reconfigurable crossbars and detection circuitry, provides concurrent single-replay detection and diagnosis, fault-mitigating repair and aging-aware lifetime management at runtime. We show that R2D3 achieves 96% coverage of defects, repairs faulty cores, and reduces V th degradation by 53%. This leads to a 78% performance improvement over 8 years and a 2.16× longer mean-time-to-failure over a baseline 8-core 3D processor with no reliability management.
DOI:10.1109/DAC18072.2020.9218497