Variation-Aware Reliable Many-Core System Design by Exploiting Inherent Core Redundancy

Reliability issues are more severe in multi/many-core systems because of the integration of more devices in advanced technology nodes. To achieve robust computing in nanoscale designs, many circuit-level and architecture-level redundancy techniques had been proposed, which pose large fixed silicon a...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on very large scale integration (VLSI) systems Vol. 25; no. 10; pp. 2803 - 2816
Main Authors Li, Huai-Ting, Chou, Ching-Yao, Hsieh, Yuan-Ting, Chu, Wei-Ching, Wu, An-Yeu
Format Journal Article
LanguageEnglish
Published IEEE 01.10.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Reliability issues are more severe in multi/many-core systems because of the integration of more devices in advanced technology nodes. To achieve robust computing in nanoscale designs, many circuit-level and architecture-level redundancy techniques had been proposed, which pose large fixed silicon area overhead and a lack of flexibility. In recent years, some methods have exploited the "inherent core redundancy" of many-core systems to implicitly implement N-modular redundant (NMR) subsystems to achieve area-efficient fault-tolerant computing. However, while facing the different levels of soft error rate, task vulnerability, and task significance in the many-core system, existing core-level redundancy methods become ineffective. To achieve robust computation in many-core systems with intercore variations and mixed workloads, we propose a variation-aware core-level redundancy scheme. Two novel approaches are presented in this scheme: 1) we construct NMR tables that store the degree of redundancy using mathematical models for systems affected by these variations and 2) we dynamically allocate each replicated task to a proper core with variation-aware mapping algorithms to achieve high reliability. Based on a modified multicore simulator, Sniper-Transient Error Process Variation (TEVR), the experimental results show that the proposed scheme can increase the reliability by 47.92% and achieve the energy saving of 39% compared with conventional core-level redundancy methods.
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2017.2715803