R2D3: A Reliability Engine for 3D Parallel Systems
This paper proposes a holistic reliability management engine, R2D3, for post-Moore's technology based parallel 3D systems that have low yield and high failure rate. The proposed engine, comprising of a controller, reconfigurable crossbars and detection circuitry, provides concurrent single-repl...
Saved in:
Published in | 2020 57th ACM/IEEE Design Automation Conference (DAC) pp. 1 - 6 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.07.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This paper proposes a holistic reliability management engine, R2D3, for post-Moore's technology based parallel 3D systems that have low yield and high failure rate. The proposed engine, comprising of a controller, reconfigurable crossbars and detection circuitry, provides concurrent single-replay detection and diagnosis, fault-mitigating repair and aging-aware lifetime management at runtime. We show that R2D3 achieves 96% coverage of defects, repairs faulty cores, and reduces V th degradation by 53%. This leads to a 78% performance improvement over 8 years and a 2.16× longer mean-time-to-failure over a baseline 8-core 3D processor with no reliability management. |
---|---|
DOI: | 10.1109/DAC18072.2020.9218497 |