Warped-RE: Low-Cost Error Detection and Correction in GPUs
Graphics processing units (GPUs) are now the dominant computing fabric within many supercomputers. As such many mission critical applications run on GPUs, which demand stringent reliability and computational correctness guarantees from GPUs. Prior approaches to GPU reliability have tackled solely ei...
Saved in:
Published in | 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks pp. 331 - 342 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2015
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Graphics processing units (GPUs) are now the dominant computing fabric within many supercomputers. As such many mission critical applications run on GPUs, which demand stringent reliability and computational correctness guarantees from GPUs. Prior approaches to GPU reliability have tackled solely either error detection, or error correction assuming error detection is already present. In this paper we present Warped Redundant Execution (Warped-RE), a unified framework that is capable of detecting and then correcting transient and nontransient errors in the GPU execution lanes. Our work exploits two critical properties of applications running on GPUs. First, we observe that neighboring execution lanes in GPUs may operate on the same values. Thus when neighboring lanes execute the same instruction using the same values then these lanes provide inherent DMR (dual modular redundancy) or even inherent TMR (triple modular redundancy) opportunities. The second property we exploit is that due to insufficient parallelism or due to branch divergence, applications do not fully utilize all the available execution lanes. In this case it is possible to force DMR or TMR on unused execution lanes, when inherent redundancy is insufficient. During error-free execution, Warped-RE uses a combination of inherent and forced DMR to guarantee that every thread computation within every warp instruction will be verified. When an error is detected in a warp instruction, the instruction is re-executed in TMR mode in order to correct the error and identify execution lanes with potential non-transient errors. Our evaluations show 8.4% and 29% average performance overhead during the DMR and TMR operation modes, respectively. Compared to traditional DMR and TMR, Warped-RE reduces the power overhead by 42% and 40%, respectively. |
---|---|
ISSN: | 1530-0889 2158-3927 |
DOI: | 10.1109/DSN.2015.55 |