Implications of accelerated self-healing as a key design knob for cross-layer resilience

In this paper we propose a cross-layer accelerated self-healing (CLASH) system which “repairs” its wearout issues in a physical sense through accelerated and active recovery, by which wearout can be reversed while actively applying several accelerated self-healing techniques, such as high temperatur...

Full description

Saved in:
Bibliographic Details
Published inIntegration (Amsterdam) Vol. 56; pp. 167 - 180
Main Authors Guo, Xinfei, Stan, Mircea R.
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.01.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper we propose a cross-layer accelerated self-healing (CLASH) system which “repairs” its wearout issues in a physical sense through accelerated and active recovery, by which wearout can be reversed while actively applying several accelerated self-healing techniques, such as high temperature and negative voltages. Different from previous solutions of coping with wearout issues (e.g. BTI) by “tolerating”, “slowing down” or “compensating”, which still leave the irreversible (permanent) wearout component unchecked, the proposed solution is able to fully avoid the irreversible wearout through periodic rejuvenation, and this is inspired by the explored frequency dependent behaviors of wearout and (accelerated and active) recovery based on measurements on FPGAs. We demonstrate a case where the chip can always be brought back to the fresh status by employing a pattern of 31-h regular operation (under room temperature and nominal voltage) followed by a 1-h accelerated self-healing (under high temperature and negative voltage). The proposed system integrates the notions of accelerated self-healing across multiple layers of the system stack. At the circuit level, a negative voltage generator and heating elements are designed and implemented; at the architecture level, the core can be allocated in a way such that the dark silicon or redundant resources can be healed by active elements; at the system level, right balance of stress and accelerated/active recovery can be employed by the system scheduler to fully mitigate the wearout; various wearout sensors act as the media between different layers. Overall, these techniques work together to guarantee that the whole system performs for more of the time at higher levels of performance and power efficiency by fully taking advantage of the extra opportunities enabled by the accelerated self-healing. •Wearout (e.g. BTI) can be reversed significantly by accelerated and active recovery (accelerated self-healing) techniques.•Irreversible component of wearout can be fully avoided through the optimal active and sleep scheduling.•A cross-layer accelerated self-healing system which integrates the accelerated and recovery techniques across the system stack is proposed.•Accelerated self-healing should be introduced as a key design knob for cross-layer resilience.
ISSN:0167-9260
1872-7522
DOI:10.1016/j.vlsi.2016.10.008