Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer

Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (B...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on device and materials reliability Vol. 5; no. 3; pp. 329 - 335
Main Authors Michalak, S.E., Harris, K.W., Hengartner, N.W., Takala, B.E., Wender, S.A.
Format Magazine Article
LanguageEnglish
Published New York IEEE 01.09.2005
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.
AbstractList Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures.
Author Michalak, S.E.
Hengartner, N.W.
Wender, S.A.
Takala, B.E.
Harris, K.W.
Author_xml – sequence: 1
  givenname: S.E.
  surname: Michalak
  fullname: Michalak, S.E.
  organization: Stat. Sci. Group, Los Alamos Nat. Lab., NM, USA
– sequence: 2
  givenname: K.W.
  surname: Harris
  fullname: Harris, K.W.
– sequence: 3
  givenname: N.W.
  surname: Hengartner
  fullname: Hengartner, N.W.
– sequence: 4
  givenname: B.E.
  surname: Takala
  fullname: Takala, B.E.
– sequence: 5
  givenname: S.A.
  surname: Wender
  fullname: Wender, S.A.
BookMark eNpdkE1LxDAQhoMouLt6F7wEL3vqOmmatjku6yes-LWeS9pOtEvb1CQ97L-3pYLg6R2Y5x2YZ06OW9MiIRcMVoyBvN7dPL2tQgCxSoWIU3FEZkyINAhFEh2PM4cg4ml6SubO7QGYTEQ8I_mLxbIqfNV-Uv-FtO2bHC01mmrlVU2d0Z6itcY6WrV0axxd16oZolW-Mu2A1Co3VnljD8th-b6hr9T1HdrCNF3v0Z6RE61qh-e_uSAfd7e7zUOwfb5_3Ky3QcHD0Ac5RhiHXEcl4wnkEktRSF2IMikTGccIUIZccQCtlVSiECpJlEikjjXTkIZ8QZbT3c6a7x6dz5rKFVjXqkXTu0wCi1MOjA3k1T9yb3o7_DJAIXAmWQQDBBNUWOOcRZ11tmqUPWQMslF5NirPRuXZpHyoXE6VChH_cBGJVHL-A-yTfuw
CODEN ITDMA2
CitedBy_id crossref_primary_10_1109_TNS_2016_2640945
crossref_primary_10_1002_2016MS000816
crossref_primary_10_1016_j_sysarc_2011_04_006
crossref_primary_10_1109_TNS_2021_3065122
crossref_primary_10_1145_1160074_1159809
crossref_primary_10_1145_1970386_1970387
crossref_primary_10_1145_3156017
crossref_primary_10_1515_mcma_2020_2076
crossref_primary_10_1109_MM_2007_107
crossref_primary_10_1007_s10766_011_0183_4
crossref_primary_10_1109_MM_2007_4
crossref_primary_10_1007_s11227_015_1422_z
crossref_primary_10_1109_TDSC_2014_2382593
crossref_primary_10_1007_s00791_016_0270_6
crossref_primary_10_1109_ACCESS_2019_2947005
crossref_primary_10_1109_TDMR_2005_854527
crossref_primary_10_1109_TDSC_2011_54
crossref_primary_10_1080_01621459_2013_770694
crossref_primary_10_1109_TDMR_2012_2192736
crossref_primary_10_1109_TPDS_2013_100
crossref_primary_10_1109_TDSC_2008_62
crossref_primary_10_1145_2666356_2594298
crossref_primary_10_3390_electronics8060653
crossref_primary_10_1145_1273442_1250741
crossref_primary_10_4218_etrij_14_0113_1133
crossref_primary_10_1109_MM_2015_5
crossref_primary_10_1109_TNS_2010_2083687
crossref_primary_10_1145_2843943
Cites_doi 10.1049/ic:20040423
10.1109/23.556861
10.1109/16.701475
10.1109/TNS.2004.839134
10.1016/j.future.2004.11.016
10.1109/23.903814
10.1147/rd.401.0041
10.1016/0168-9002(93)91102-S
10.1109/23.568799
10.1147/rd.401.0051
10.1109/ICCD.1998.727028
10.1109/TDMR.2005.854527
10.1109/23.273471
10.1109/23.340563
10.1109/TNS.2003.821593
10.1109/JSSC.2004.831449
10.1109/4.871318
ContentType Magazine Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2005
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2005
DBID 97E
RIA
RIE
AAYXX
CITATION
7SP
8FD
L7M
F28
FR3
KR7
DOI 10.1109/TDMR.2005.855685
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) 1998-Present
IEEE/IET Electronic Library
CrossRef
Electronics & Communications Abstracts
Technology Research Database
Advanced Technologies Database with Aerospace
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
Civil Engineering Abstracts
DatabaseTitle CrossRef
Technology Research Database
Advanced Technologies Database with Aerospace
Electronics & Communications Abstracts
Civil Engineering Abstracts
Engineering Research Database
ANTE: Abstracts in New Technology & Engineering
DatabaseTitleList
Civil Engineering Abstracts
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2574
EndPage 335
ExternalDocumentID 2581306761
10_1109_TDMR_2005_855685
1545893
Genre orig-research
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AASAJ
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AIBXA
AKJIK
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
IFIPE
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RIG
RNS
XFK
AAYXX
CITATION
7SP
8FD
L7M
F28
FR3
KR7
ID FETCH-LOGICAL-c322t-be4e623f4d1370b9ed5c9fc5d7d7966e00d23a300ffa9a5c5a77a579f6f1f0823
IEDL.DBID RIE
ISSN 1530-4388
IngestDate Sat Aug 17 03:59:33 EDT 2024
Thu Oct 10 18:39:40 EDT 2024
Fri Aug 23 00:44:48 EDT 2024
Wed Jun 26 19:20:37 EDT 2024
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c322t-be4e623f4d1370b9ed5c9fc5d7d7966e00d23a300ffa9a5c5a77a579f6f1f0823
Notes ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
PQID 920319140
PQPubID 85509
PageCount 7
ParticipantIDs ieee_primary_1545893
proquest_miscellaneous_901683011
proquest_journals_920319140
crossref_primary_10_1109_TDMR_2005_855685
PublicationCentury 2000
PublicationDate 2005-09-01
PublicationDateYYYYMMDD 2005-09-01
PublicationDate_xml – month: 09
  year: 2005
  text: 2005-09-01
  day: 01
PublicationDecade 2000
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on device and materials reliability
PublicationTitleAbbrev TDMR
PublicationYear 2005
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref14
ref11
daly (ref26) 2005
ref10
kobayashi (ref17) 2004
(ref6) 2001
ref16
wender (ref8) 2003
ref19
ref18
compaq computer corporation (ref2) 2002
daly (ref24) 2003
ref23
ref25
ref20
ref22
hazucha (ref15) 2000; 47
ref7
ref9
(ref4) 0
(ref1) 2003
ref3
michalak (ref21) 2004
ref5
References_xml – ident: ref23
  doi: 10.1049/ic:20040423
– year: 2004
  ident: ref21
  article-title: using the lansce irradiation facility to predict the number of fatal soft errors in one of the world's fastest supercomputers
  publication-title: Proc 18th Int Conf Application Accelerators Research and Industry
  contributor:
    fullname: michalak
– ident: ref11
  doi: 10.1109/23.556861
– ident: ref13
  doi: 10.1109/16.701475
– year: 2001
  ident: ref6
  publication-title: Measurement and Reporting of Alpha Particles and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices
– year: 2003
  ident: ref1
  publication-title: TOP500 List
– ident: ref7
  doi: 10.1109/TNS.2004.839134
– start-page: 288
  year: 2004
  ident: ref17
  article-title: comparison between neutron-induced system-ser and accelerated-ser in srams
  publication-title: Proc 42nd Int Reliability Physics Symp
  contributor:
    fullname: kobayashi
– ident: ref25
  doi: 10.1016/j.future.2004.11.016
– volume: 47
  start-page: 2595
  year: 2000
  ident: ref15
  article-title: cosmic ray neutrons multiple-upset measurements in a 0.6-$\mu{\rm m}$ cmos process
  publication-title: IEEE Trans Nucl Sci
  doi: 10.1109/23.903814
  contributor:
    fullname: hazucha
– year: 2003
  ident: ref24
  publication-title: Milestone Performances on the Q Machine
  contributor:
    fullname: daly
– year: 0
  ident: ref4
  publication-title: The Message Passing Interface (MPI) standard
– year: 2002
  ident: ref2
  publication-title: AlphaServer ES45 Owners Guide
  contributor:
    fullname: compaq computer corporation
– ident: ref20
  doi: 10.1147/rd.401.0041
– ident: ref22
  doi: 10.1016/0168-9002(93)91102-S
– ident: ref12
  doi: 10.1109/23.568799
– ident: ref19
  doi: 10.1147/rd.401.0051
– ident: ref3
  doi: 10.1109/ICCD.1998.727028
– ident: ref5
  doi: 10.1109/TDMR.2005.854527
– ident: ref9
  doi: 10.1109/23.273471
– ident: ref10
  doi: 10.1109/23.340563
– ident: ref16
  doi: 10.1109/TNS.2003.821593
– ident: ref18
  doi: 10.1109/JSSC.2004.831449
– year: 2005
  ident: ref26
  article-title: evaluating the performance of a checkpointing application given the number and types of interrupts
  publication-title: Proc Workshop High Performance Computing Reliability Issues
  contributor:
    fullname: daly
– ident: ref14
  doi: 10.1109/4.871318
– year: 2003
  ident: ref8
  article-title: neutron single event effects testing at lansce
  publication-title: IEEE Int Reliability Physics Symp
  contributor:
    fullname: wender
SSID ssj0019756
Score 1.2774068
Snippet Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Publisher
StartPage 329
SubjectTerms Computation
Computational modeling
Computer errors
Cosmic-ray-induced neutron
Crashes
Elevated
Error correction codes
Failure
Fatal
Laboratories
life estimation
Life testing
linear accelerators
memory testing
neutron beam
neutron radiation effects
neutron-induced soft error
Neutrons
Random access memory
Runtime
Semiconductor device testing
semiconductor-device radiation effects
single-event upset
Soft errors
soft-error rate
static random access memory (SRAM) chips
Strategy
Supercomputers
Title Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer
URI https://ieeexplore.ieee.org/document/1545893
https://www.proquest.com/docview/920319140
https://search.proquest.com/docview/901683011
Volume 5
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwEB0tnOgFClRdoJUPSKgSWbybOI6PiLJCFVu1wEp7ixx7LFVICUo2B_j1jJ3NtnwcekqkWJHlmbFn5o3nARxTaBwXmI4jo62MkgkXUUFqEEkXp6kzxSQL3ICzn-nVPPmxEIsBnK7vwiBiKD7DkX8NWL6tTOtTZWcB5VHxBmxIpbq7WmvEQMnA1EoGzKMkzrIekuTq7O777KbLnmS-35Z4cQQFTpU3G3E4XabbMOvn1RWV3I_aZTEyT69aNv7vxHdgu28bzc47xfgIAyx34cM_3Qf3oPhVe5TG1z0zcgNZRw7CKsecz-mwhnZohnVd1Q37U7LrqmHnpD_06DOIbKVCVf14Qh9vL9hv1rQPWJsVV8Q-zKeXdxdX0YpyITJk2cuowATJIXKJHceSFwqtMMoZYaWVFBgh53YS65hz57TSwggtpRZSudSNnQftPsFmWZX4GZhzaLWm4z9GnShnM8sTtBI1CcsmaIbwrZdC_tB11shDRMJV7iXmCTJF3klsCHt-Uf-O69ZzCIe92PKV6TW5mviLWRQ3DoGtv5LNeCBEl1i1NIT83MzvbAfv__cQtkKH1lBKdgSby7rFL-R7LIuvQemeAX1m19E
link.rule.ids 783,787,799,27937,55086
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9swDCay7ND10q3d0LR76FBgGDCnSmxZ1rHoFmRbUuyRAr0ZskQBwwA7sOND--tHyXHWPQ472YAFQxBJieRH8QM4o9A4LjCdREZbGSVTLqKC1CCSLk5TZ4ppFrgBl1fp_Dr5eCNuBvB2dxcGEUPxGY79a8DybWVanyo7DyiPih_AQ_Krs7S7rbXDDJQMXK1kwjxK4izrQUmuzlfvll-7_EnmO26J3w6hwKry11YczpfZASz7mXVlJT_G7aYYm7s_mjb-79Qfw0HfOJpddKrxBAZYHsL-vf6DR1B8rj1O4yufGTmCrKMHYZVjzmd1WEN7NMO6ruqGfS_ZomrYBWkQPfocItsqUVXfvqaP3y7ZF9a0a6zNli3iKVzP3q8u59GWdCEyZNubqMAEySVyiZ3EkhcKrTDKGWGllRQaIed2GuuYc-e00sIILaUWUrnUTZyH7Z7BsKxKPAbmHFqtyQGIUSfK2czyBK1ETcKyCZoRvOmlkK-73hp5iEm4yr3EPEWmyDuJjeDIL-qvcd16juC0F1u-Nb4mV1N_NYsixxGw3VeyGg-F6BKrloaQp5v5ve3k3_99BXvz1XKRLz5cfTqFR6Ffaygsew7DTd3iC_JENsXLoIA_AerL2xw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Predicting+the+number+of+fatal+soft+errors+in+Los+Alamos+national+laboratory%27s+ASC+Q+supercomputer&rft.jtitle=IEEE+transactions+on+device+and+materials+reliability&rft.au=Michalak%2C+S.E.&rft.au=Harris%2C+K.W.&rft.au=Hengartner%2C+N.W.&rft.au=Takala%2C+B.E.&rft.date=2005-09-01&rft.pub=IEEE&rft.issn=1530-4388&rft.eissn=1558-2574&rft.volume=5&rft.issue=3&rft.spage=329&rft.epage=335&rft_id=info:doi/10.1109%2FTDMR.2005.855685&rft.externalDocID=1545893
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1530-4388&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1530-4388&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1530-4388&client=summon