Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer
Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (B...
Saved in:
Published in | IEEE transactions on device and materials reliability Vol. 5; no. 3; pp. 329 - 335 |
---|---|
Main Authors | , , , , |
Format | Magazine Article |
Language | English |
Published |
New York
IEEE
01.09.2005
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures. |
---|---|
AbstractList | Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The elevated rate of single-node failures was hypothesized to be caused primarily by fatal soft errors, i.e., board-level cache (B-cache) tag (BTAG) parity errors caused by cosmic-ray-induced neutrons that led to node crashes. A series of experiments was undertaken at the Los Alamos Neutron Science Center (LANSCE) to ascertain whether fatal soft errors were indeed the primary cause of the elevated rate of single-node failures. Observed failure data from Q are consistent with the results from some of these experiments. Mitigation strategies have been developed, and scientists successfully use Q for large computations in the presence of fatal soft errors and other single-node failures. |
Author | Michalak, S.E. Hengartner, N.W. Wender, S.A. Takala, B.E. Harris, K.W. |
Author_xml | – sequence: 1 givenname: S.E. surname: Michalak fullname: Michalak, S.E. organization: Stat. Sci. Group, Los Alamos Nat. Lab., NM, USA – sequence: 2 givenname: K.W. surname: Harris fullname: Harris, K.W. – sequence: 3 givenname: N.W. surname: Hengartner fullname: Hengartner, N.W. – sequence: 4 givenname: B.E. surname: Takala fullname: Takala, B.E. – sequence: 5 givenname: S.A. surname: Wender fullname: Wender, S.A. |
BookMark | eNpdkE1LxDAQhoMouLt6F7wEL3vqOmmatjku6yes-LWeS9pOtEvb1CQ97L-3pYLg6R2Y5x2YZ06OW9MiIRcMVoyBvN7dPL2tQgCxSoWIU3FEZkyINAhFEh2PM4cg4ml6SubO7QGYTEQ8I_mLxbIqfNV-Uv-FtO2bHC01mmrlVU2d0Z6itcY6WrV0axxd16oZolW-Mu2A1Co3VnljD8th-b6hr9T1HdrCNF3v0Z6RE61qh-e_uSAfd7e7zUOwfb5_3Ky3QcHD0Ac5RhiHXEcl4wnkEktRSF2IMikTGccIUIZccQCtlVSiECpJlEikjjXTkIZ8QZbT3c6a7x6dz5rKFVjXqkXTu0wCi1MOjA3k1T9yb3o7_DJAIXAmWQQDBBNUWOOcRZ11tmqUPWQMslF5NirPRuXZpHyoXE6VChH_cBGJVHL-A-yTfuw |
CODEN | ITDMA2 |
CitedBy_id | crossref_primary_10_1109_TNS_2016_2640945 crossref_primary_10_1002_2016MS000816 crossref_primary_10_1016_j_sysarc_2011_04_006 crossref_primary_10_1109_TNS_2021_3065122 crossref_primary_10_1145_1160074_1159809 crossref_primary_10_1145_1970386_1970387 crossref_primary_10_1145_3156017 crossref_primary_10_1515_mcma_2020_2076 crossref_primary_10_1109_MM_2007_107 crossref_primary_10_1007_s10766_011_0183_4 crossref_primary_10_1109_MM_2007_4 crossref_primary_10_1007_s11227_015_1422_z crossref_primary_10_1109_TDSC_2014_2382593 crossref_primary_10_1007_s00791_016_0270_6 crossref_primary_10_1109_ACCESS_2019_2947005 crossref_primary_10_1109_TDMR_2005_854527 crossref_primary_10_1109_TDSC_2011_54 crossref_primary_10_1080_01621459_2013_770694 crossref_primary_10_1109_TDMR_2012_2192736 crossref_primary_10_1109_TPDS_2013_100 crossref_primary_10_1109_TDSC_2008_62 crossref_primary_10_1145_2666356_2594298 crossref_primary_10_3390_electronics8060653 crossref_primary_10_1145_1273442_1250741 crossref_primary_10_4218_etrij_14_0113_1133 crossref_primary_10_1109_MM_2015_5 crossref_primary_10_1109_TNS_2010_2083687 crossref_primary_10_1145_2843943 |
Cites_doi | 10.1049/ic:20040423 10.1109/23.556861 10.1109/16.701475 10.1109/TNS.2004.839134 10.1016/j.future.2004.11.016 10.1109/23.903814 10.1147/rd.401.0041 10.1016/0168-9002(93)91102-S 10.1109/23.568799 10.1147/rd.401.0051 10.1109/ICCD.1998.727028 10.1109/TDMR.2005.854527 10.1109/23.273471 10.1109/23.340563 10.1109/TNS.2003.821593 10.1109/JSSC.2004.831449 10.1109/4.871318 |
ContentType | Magazine Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2005 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2005 |
DBID | 97E RIA RIE AAYXX CITATION 7SP 8FD L7M F28 FR3 KR7 |
DOI | 10.1109/TDMR.2005.855685 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998-Present IEEE/IET Electronic Library CrossRef Electronics & Communications Abstracts Technology Research Database Advanced Technologies Database with Aerospace ANTE: Abstracts in New Technology & Engineering Engineering Research Database Civil Engineering Abstracts |
DatabaseTitle | CrossRef Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts Civil Engineering Abstracts Engineering Research Database ANTE: Abstracts in New Technology & Engineering |
DatabaseTitleList | Civil Engineering Abstracts |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2574 |
EndPage | 335 |
ExternalDocumentID | 2581306761 10_1109_TDMR_2005_855685 1545893 |
Genre | orig-research |
GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AASAJ ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AIBXA AKJIK ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 IFIPE IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RIG RNS XFK AAYXX CITATION 7SP 8FD L7M F28 FR3 KR7 |
ID | FETCH-LOGICAL-c322t-be4e623f4d1370b9ed5c9fc5d7d7966e00d23a300ffa9a5c5a77a579f6f1f0823 |
IEDL.DBID | RIE |
ISSN | 1530-4388 |
IngestDate | Sat Aug 17 03:59:33 EDT 2024 Thu Oct 10 18:39:40 EDT 2024 Fri Aug 23 00:44:48 EDT 2024 Wed Jun 26 19:20:37 EDT 2024 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c322t-be4e623f4d1370b9ed5c9fc5d7d7966e00d23a300ffa9a5c5a77a579f6f1f0823 |
Notes | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
PQID | 920319140 |
PQPubID | 85509 |
PageCount | 7 |
ParticipantIDs | ieee_primary_1545893 proquest_miscellaneous_901683011 proquest_journals_920319140 crossref_primary_10_1109_TDMR_2005_855685 |
PublicationCentury | 2000 |
PublicationDate | 2005-09-01 |
PublicationDateYYYYMMDD | 2005-09-01 |
PublicationDate_xml | – month: 09 year: 2005 text: 2005-09-01 day: 01 |
PublicationDecade | 2000 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on device and materials reliability |
PublicationTitleAbbrev | TDMR |
PublicationYear | 2005 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref14 ref11 daly (ref26) 2005 ref10 kobayashi (ref17) 2004 (ref6) 2001 ref16 wender (ref8) 2003 ref19 ref18 compaq computer corporation (ref2) 2002 daly (ref24) 2003 ref23 ref25 ref20 ref22 hazucha (ref15) 2000; 47 ref7 ref9 (ref4) 0 (ref1) 2003 ref3 michalak (ref21) 2004 ref5 |
References_xml | – ident: ref23 doi: 10.1049/ic:20040423 – year: 2004 ident: ref21 article-title: using the lansce irradiation facility to predict the number of fatal soft errors in one of the world's fastest supercomputers publication-title: Proc 18th Int Conf Application Accelerators Research and Industry contributor: fullname: michalak – ident: ref11 doi: 10.1109/23.556861 – ident: ref13 doi: 10.1109/16.701475 – year: 2001 ident: ref6 publication-title: Measurement and Reporting of Alpha Particles and Terrestrial Cosmic Ray-Induced Soft Errors in Semiconductor Devices – year: 2003 ident: ref1 publication-title: TOP500 List – ident: ref7 doi: 10.1109/TNS.2004.839134 – start-page: 288 year: 2004 ident: ref17 article-title: comparison between neutron-induced system-ser and accelerated-ser in srams publication-title: Proc 42nd Int Reliability Physics Symp contributor: fullname: kobayashi – ident: ref25 doi: 10.1016/j.future.2004.11.016 – volume: 47 start-page: 2595 year: 2000 ident: ref15 article-title: cosmic ray neutrons multiple-upset measurements in a 0.6-$\mu{\rm m}$ cmos process publication-title: IEEE Trans Nucl Sci doi: 10.1109/23.903814 contributor: fullname: hazucha – year: 2003 ident: ref24 publication-title: Milestone Performances on the Q Machine contributor: fullname: daly – year: 0 ident: ref4 publication-title: The Message Passing Interface (MPI) standard – year: 2002 ident: ref2 publication-title: AlphaServer ES45 Owners Guide contributor: fullname: compaq computer corporation – ident: ref20 doi: 10.1147/rd.401.0041 – ident: ref22 doi: 10.1016/0168-9002(93)91102-S – ident: ref12 doi: 10.1109/23.568799 – ident: ref19 doi: 10.1147/rd.401.0051 – ident: ref3 doi: 10.1109/ICCD.1998.727028 – ident: ref5 doi: 10.1109/TDMR.2005.854527 – ident: ref9 doi: 10.1109/23.273471 – ident: ref10 doi: 10.1109/23.340563 – ident: ref16 doi: 10.1109/TNS.2003.821593 – ident: ref18 doi: 10.1109/JSSC.2004.831449 – year: 2005 ident: ref26 article-title: evaluating the performance of a checkpointing application given the number and types of interrupts publication-title: Proc Workshop High Performance Computing Reliability Issues contributor: fullname: daly – ident: ref14 doi: 10.1109/4.871318 – year: 2003 ident: ref8 article-title: neutron single event effects testing at lansce publication-title: IEEE Int Reliability Physics Symp contributor: fullname: wender |
SSID | ssj0019756 |
Score | 1.2774068 |
Snippet | Early in the deployment of the Advanced Simulation and Computing (ASC) Q supercomputer, a higher-than-expected number of single-node failures was observed. The... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Publisher |
StartPage | 329 |
SubjectTerms | Computation Computational modeling Computer errors Cosmic-ray-induced neutron Crashes Elevated Error correction codes Failure Fatal Laboratories life estimation Life testing linear accelerators memory testing neutron beam neutron radiation effects neutron-induced soft error Neutrons Random access memory Runtime Semiconductor device testing semiconductor-device radiation effects single-event upset Soft errors soft-error rate static random access memory (SRAM) chips Strategy Supercomputers |
Title | Predicting the number of fatal soft errors in Los Alamos national laboratory's ASC Q supercomputer |
URI | https://ieeexplore.ieee.org/document/1545893 https://www.proquest.com/docview/920319140 https://search.proquest.com/docview/901683011 |
Volume | 5 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT9wwEB0tnOgFClRdoJUPSKgSWbybOI6PiLJCFVu1wEp7ixx7LFVICUo2B_j1jJ3NtnwcekqkWJHlmbFn5o3nARxTaBwXmI4jo62MkgkXUUFqEEkXp6kzxSQL3ICzn-nVPPmxEIsBnK7vwiBiKD7DkX8NWL6tTOtTZWcB5VHxBmxIpbq7WmvEQMnA1EoGzKMkzrIekuTq7O777KbLnmS-35Z4cQQFTpU3G3E4XabbMOvn1RWV3I_aZTEyT69aNv7vxHdgu28bzc47xfgIAyx34cM_3Qf3oPhVe5TG1z0zcgNZRw7CKsecz-mwhnZohnVd1Q37U7LrqmHnpD_06DOIbKVCVf14Qh9vL9hv1rQPWJsVV8Q-zKeXdxdX0YpyITJk2cuowATJIXKJHceSFwqtMMoZYaWVFBgh53YS65hz57TSwggtpRZSudSNnQftPsFmWZX4GZhzaLWm4z9GnShnM8sTtBI1CcsmaIbwrZdC_tB11shDRMJV7iXmCTJF3klsCHt-Uf-O69ZzCIe92PKV6TW5mviLWRQ3DoGtv5LNeCBEl1i1NIT83MzvbAfv__cQtkKH1lBKdgSby7rFL-R7LIuvQemeAX1m19E |
link.rule.ids | 783,787,799,27937,55086 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1Lb9swDCay7ND10q3d0LR76FBgGDCnSmxZ1rHoFmRbUuyRAr0ZskQBwwA7sOND--tHyXHWPQ472YAFQxBJieRH8QM4o9A4LjCdREZbGSVTLqKC1CCSLk5TZ4ppFrgBl1fp_Dr5eCNuBvB2dxcGEUPxGY79a8DybWVanyo7DyiPih_AQ_Krs7S7rbXDDJQMXK1kwjxK4izrQUmuzlfvll-7_EnmO26J3w6hwKry11YczpfZASz7mXVlJT_G7aYYm7s_mjb-79Qfw0HfOJpddKrxBAZYHsL-vf6DR1B8rj1O4yufGTmCrKMHYZVjzmd1WEN7NMO6ruqGfS_ZomrYBWkQPfocItsqUVXfvqaP3y7ZF9a0a6zNli3iKVzP3q8u59GWdCEyZNubqMAEySVyiZ3EkhcKrTDKGWGllRQaIed2GuuYc-e00sIILaUWUrnUTZyH7Z7BsKxKPAbmHFqtyQGIUSfK2czyBK1ETcKyCZoRvOmlkK-73hp5iEm4yr3EPEWmyDuJjeDIL-qvcd16juC0F1u-Nb4mV1N_NYsixxGw3VeyGg-F6BKrloaQp5v5ve3k3_99BXvz1XKRLz5cfTqFR6Ffaygsew7DTd3iC_JENsXLoIA_AerL2xw |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Predicting+the+number+of+fatal+soft+errors+in+Los+Alamos+national+laboratory%27s+ASC+Q+supercomputer&rft.jtitle=IEEE+transactions+on+device+and+materials+reliability&rft.au=Michalak%2C+S.E.&rft.au=Harris%2C+K.W.&rft.au=Hengartner%2C+N.W.&rft.au=Takala%2C+B.E.&rft.date=2005-09-01&rft.pub=IEEE&rft.issn=1530-4388&rft.eissn=1558-2574&rft.volume=5&rft.issue=3&rft.spage=329&rft.epage=335&rft_id=info:doi/10.1109%2FTDMR.2005.855685&rft.externalDocID=1545893 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1530-4388&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1530-4388&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1530-4388&client=summon |