Full Fault Resilience and Relaxed Synchronization Requirements at the Cache-Memory Interface

While multicore platforms promise significant speedup for many current applications, they also suffer from increased reliability problems as a result of ever scaling device size. The projected elevation in fault rate, together with the diverse behavior of fault manifestation, argues for highly effic...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on very large scale integration (VLSI) systems Vol. 19; no. 11; pp. 1996 - 2009
Main Authors	Chengmo Yang, Orailoglu, Alex
Format	Journal Article
Language	English
Published	New York, NY IEEE 01.11.2011 Institute of Electrical and Electronics Engineers The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Applied sciences Blocking Checkpointing Design. Technologies. Operation analysis. Testing Electronics Exact sciences and technology Fault detection Fault tolerance Fault tolerant systems Faults Instruction sets Integrated circuits Integrated circuits by function (including memories and processors) multicore reliability recovery redundant execution Registers Reproduction Resilience Semiconductor electronics. Microelectronics. Optoelectronics. Solid state devices Synchronism Synchronization Testing, measurement, noise and reliability Very large scale integration Checkpointing multicore reliability Cache memory fault detection Updating recovery Synchronization Thread Relaxation Integrated circuit redundant execution Reliability Comparative study Defect detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While multicore platforms promise significant speedup for many current applications, they also suffer from increased reliability problems as a result of ever scaling device size. The projected elevation in fault rate, together with the diverse behavior of fault manifestation, argues for highly efficient solutions of full fault resilience. Traditional duplication and checkpointing strategies typically impose sizable overhead in checkpointing execution results, or in constantly synchronizing two threads for value checking. To reduce such overhead while at the same time delivering full fault resilience, we propose an integrated fault detection and checkpointing framework, wherein the comparison and checkpointing process is performed at the cache-memory interface. By sharing a single cache between two duplicated threads, execution results can be directly verified in the cache before being written back, thus strictly protecting the memory against execution faults. Meanwhile, as unconfirmed data are allowed to be written into the cache, one thread can run well ahead of the other, thus relaxing the straightjacket of the strict execution synchronization model. If a cache block is constantly updated, further synchronization relaxation can be achieved through extending the cache design to duplicate a cache block and skip the comparison of the intermediate values.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1063-8210 1557-9999
DOI:	10.1109/TVLSI.2010.2067230