Tests and tolerances for high-performance software-implemehted fault detection

We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency imple...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 52; no. 5; pp. 579 - 591
Main Authors	Turmon, M., Granat, R., Katz, D.S., Lou, J.Z.
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2003 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Computation Computer fault tolerance Error analysis Errors Fault detection Fault tolerance Floating point arithmetic Mathematical models Parallel algorithms Roundoff errors Software fault tolerance Tolerances
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper, bounds on error.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 content type line 23
ISSN:	0018-9340 1557-9956
DOI:	10.1109/TC.2003.1197125