To repair or not to repair: Assessing fault resilience in MPI stencil applications

With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This tren...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 205; p. 105156
Main Authors Rocco, Roberto, Boella, Elisabetta, Gregori, Daniele, Palermo, Gianluca
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.11.2025
Subjects
Online AccessGet full text

Cover

Loading…
Abstract With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario. •Faults are becoming a critical issue in HPC executions, as MPI cannot handle them.•Checkpointing, while widespread, is time-consuming, disk demanding and poorly scalable.•Through fault resilience, we sacrifice result accuracy for faster recovery.•Experiments show that the loss of accuracy does not compromise result usability.
AbstractList With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario. •Faults are becoming a critical issue in HPC executions, as MPI cannot handle them.•Checkpointing, while widespread, is time-consuming, disk demanding and poorly scalable.•Through fault resilience, we sacrifice result accuracy for faster recovery.•Experiments show that the loss of accuracy does not compromise result usability.
ArticleNumber 105156
Author Boella, Elisabetta
Palermo, Gianluca
Rocco, Roberto
Gregori, Daniele
Author_xml – sequence: 1
  givenname: Roberto
  orcidid: 0000-0002-0223-2900
  surname: Rocco
  fullname: Rocco, Roberto
  email: roberto.rocco@e4company.com
  organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy
– sequence: 2
  givenname: Elisabetta
  orcidid: 0000-0003-1970-6794
  surname: Boella
  fullname: Boella, Elisabetta
  email: elisabetta.boella@e4company.com
  organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy
– sequence: 3
  givenname: Daniele
  surname: Gregori
  fullname: Gregori, Daniele
  email: daniele.gregori@e4company.com
  organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy
– sequence: 4
  givenname: Gianluca
  orcidid: 0000-0001-7955-8012
  surname: Palermo
  fullname: Palermo, Gianluca
  email: gianluca.palermo@polimi.it
  organization: DEIB - Politecnico di Milano, Via Giuseppe Ponzio, 34, Milan, Italy
BookMark eNp9kM1KAzEURrOoYFt9AVd5ganJZJLMiJtS1BYqitR1yK9kGJMhiYJv79Tq1tXlfnA-7j0LMAsxWACuMFphhNl1v-pHo1c1qukUUEzZDMwRb0jFCabnYJFzjxDGlLdz8HKIMNlR-gRjgiEWWP6CG7jO2ebswxt08mMoU5794G3QFvoAH593MJdp8wOU4zh4LYuPIV-AMyeHbC9_5xK83t8dNttq__Sw26z3lcasLZVrHMFao7olRlPTcOa46ghWREnW2Fq3lDSK1ahpO9lJpSyzpjOca0eIoposQX3q1SnmnKwTY_LvMn0JjMTRhOjF0YQ4mhAnExN0e4LsdNmnt0lk_fOR8cnqIkz0_-HfitFr0Q
Cites_doi 10.1002/cpe.5826
10.1002/cpe.4851
10.1016/j.matcom.2009.08.038
10.1103/RevModPhys.55.403
10.1177/1094342015623623
10.1177/1094342013488238
10.1007/s11227-016-1863-z
10.1109/TPDS.2008.58
10.1016/j.procs.2024.07.009
10.1103/PhysRevLett.2.83
10.1016/j.future.2020.01.026
10.1016/j.jpdc.2015.07.005
10.1137/0907058
10.1016/0021-9991(82)90016-X
10.1016/j.future.2018.09.041
10.1109/TSSC.1968.300136
10.1007/s11227-019-02799-5
10.1145/2370036.2145845
ContentType Journal Article
Copyright 2025 Elsevier Inc.
Copyright_xml – notice: 2025 Elsevier Inc.
DBID AAYXX
CITATION
DOI 10.1016/j.jpdc.2025.105156
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
ExternalDocumentID 10_1016_j_jpdc_2025_105156
S0743731525001236
GroupedDBID --K
--M
-~X
.~1
0R~
1B1
1~.
1~5
29L
4.4
457
4G.
5GY
5VS
7-5
71M
8P~
9JN
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
AAYWO
ABBOA
ABDPE
ABEFU
ABFNM
ABFSI
ABJNI
ABMAC
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACVFH
ACZNC
ADBBV
ADCNI
ADEZE
ADFGL
ADHUB
ADJOM
ADMUD
ADNMO
ADTZH
ADVLN
AEBSH
AECPX
AEIPS
AEKER
AENEX
AEUPX
AFJKZ
AFPUW
AFTJW
AGCQF
AGHFR
AGQPQ
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIGII
AIIUN
AIKHN
AITUG
AKBMS
AKRWK
AKYEP
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
APXCP
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
E.L
EBS
EFBJH
EFKBS
EJD
EO8
EO9
EP2
EP3
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-2
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
K-O
KOM
LG5
LG9
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SET
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
TWZ
WUQ
XJT
XOL
XPP
ZMT
ZU3
ZY4
~G-
AAYXX
CITATION
ID FETCH-LOGICAL-c168t-f4f31cc0283dc5d476f7b931b3ba64e2c8534b620489a9abbe6ed9d77cf33b5c3
IEDL.DBID .~1
ISSN 0743-7315
IngestDate Thu Aug 21 00:06:52 EDT 2025
Sat Aug 30 17:13:23 EDT 2025
IsPeerReviewed true
IsScholarly true
Keywords Fault resilience
Checkpoint and restart
iPiC3D
MPI
Stencil applications
Legio
User level fault mitigation extension
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c168t-f4f31cc0283dc5d476f7b931b3ba64e2c8534b620489a9abbe6ed9d77cf33b5c3
ORCID 0000-0003-1970-6794
0000-0002-0223-2900
0000-0001-7955-8012
ParticipantIDs crossref_primary_10_1016_j_jpdc_2025_105156
elsevier_sciencedirect_doi_10_1016_j_jpdc_2025_105156
PublicationCentury 2000
PublicationDate November 2025
2025-11-00
PublicationDateYYYYMMDD 2025-11-01
PublicationDate_xml – month: 11
  year: 2025
  text: November 2025
PublicationDecade 2020
PublicationTitle Journal of parallel and distributed computing
PublicationYear 2025
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Heroux, Doerfler, Crozier, Willenbring, Edwards, Williams, Rajan, Keiter, Thornquist, Numrich (br0190) 2009
Holmes, Mohror (br0170) 2016
Losada, Martín (br0200) 2017; 73
Dawson (br0370) 1983; 55
Rocco (br0140) 2021
Duarte, Bona, Ruoso (br0270) 2014
Clarke, Glendinning (br0030) 1994
Losada, Bosilca (br0110) 2019; 91
Filiposka, Mishev (br0120) 2019; 75
Du (br0250) 2012; 47
Chen (br0260) 2008; 19
Prahl (br0330) 1989; vol. 10305
Gibbon (br0390) 2007
Garg, Price, Cooperman (br0090) 2019
Dichev, Cameron, Nikolopoulos (br0100) 2018
McIntosh-Smith, Martineau, Deakin, Pawelczak, Gaudin, Garrett, Liu, Smedley-Stevenson, Beckingsale (br0150) 2017
Markidis, Lapenta, Rizwan-uddin (br0160) 2010; 80
Hart, Nilsson, Raphael (br0450) 1968; 4
Teranishi, Heroux (br0180) 2014
Bernholdt, Boehm, Bosilca, Gorentla Venkata, Grant, Naughton, Pritchard, Schulz, Vallee (br0340) 2020; 32
Saad, Schultz (br0410) 1986; 7
Rocco, Boella, Gregori, Palermo (br0440) 2024; 240
Rocco, Repetti, Boella, Gregori, Palermo (br0310) 2024
Lapenta (br0380) 2011
Hargrove, Duell (br0060) 2006; vol. 46
Laguna, Richards, Gamblin, Schulz, de Supinski (br0240) 2014
Rocco, Palermo (br0320) 2023
Fang, Fujita, Chien (br0230) 2015
Margolin, Barak (br0290) 2021; 33
Koranne (br0430) 2011
Hochschild (br0010) 2021
Ansel, Arya (br0070) 2009
Balay, Abhyankar, Adams, Benson, Brown, Brune, Buschelman, Constantinescu, Dalcin, Dener, Eijkhout, Faibussowitsch, Gropp, Hapla, Isaac, Jolivet, Karpeev, Kaushik, Knepley, Kong, Kruger, May, McInnes, Mills, Mitchell, Munson, Roman, Rupp, Sanan, Sarich, Smith, Zampini, Zhang, Zhang, Zhang, page (br0420) 2024
Reber (br0080) 2012
Mallinson, Beckingsale, Gaudin, Herdman, Levesque, Jarvis (br0350) 2013
Losada, González, Martín, Bosilca, Bouteiller, Teranishi (br0300) 2020; 106
Bland (br0040) 2013; 27
Ashraf, Hukerikar (br0130) 2018
Gamell (br0210) 2015
Laguna, Richards (br0050) 2016; 30
Rocco, Palermo (br0280) 2023
Brackbill, Forslund (br0400) 1982; 46
Weibel (br0460) 1959; 2
Hestenes, Stiefel (br0360) 1952
Dixit, Pendharkar (br0020) 2021
Pauli, Arbenz, Schwab (br0220) 2015; 84
Bland (10.1016/j.jpdc.2025.105156_br0040) 2013; 27
Chen (10.1016/j.jpdc.2025.105156_br0260) 2008; 19
Saad (10.1016/j.jpdc.2025.105156_br0410) 1986; 7
Balay (10.1016/j.jpdc.2025.105156_br0420)
Weibel (10.1016/j.jpdc.2025.105156_br0460) 1959; 2
Teranishi (10.1016/j.jpdc.2025.105156_br0180) 2014
Bernholdt (10.1016/j.jpdc.2025.105156_br0340) 2020; 32
Fang (10.1016/j.jpdc.2025.105156_br0230) 2015
Dichev (10.1016/j.jpdc.2025.105156_br0100) 2018
Hargrove (10.1016/j.jpdc.2025.105156_br0060) 2006; vol. 46
Heroux (10.1016/j.jpdc.2025.105156_br0190) 2009
Losada (10.1016/j.jpdc.2025.105156_br0200) 2017; 73
Koranne (10.1016/j.jpdc.2025.105156_br0430) 2011
Dixit (10.1016/j.jpdc.2025.105156_br0020)
Margolin (10.1016/j.jpdc.2025.105156_br0290) 2021; 33
Laguna (10.1016/j.jpdc.2025.105156_br0240) 2014
Dawson (10.1016/j.jpdc.2025.105156_br0370) 1983; 55
Losada (10.1016/j.jpdc.2025.105156_br0300) 2020; 106
Rocco (10.1016/j.jpdc.2025.105156_br0440) 2024; 240
Ashraf (10.1016/j.jpdc.2025.105156_br0130) 2018
Lapenta (10.1016/j.jpdc.2025.105156_br0380) 2011
Gamell (10.1016/j.jpdc.2025.105156_br0210) 2015
Du (10.1016/j.jpdc.2025.105156_br0250) 2012; 47
Filiposka (10.1016/j.jpdc.2025.105156_br0120) 2019; 75
Pauli (10.1016/j.jpdc.2025.105156_br0220) 2015; 84
Prahl (10.1016/j.jpdc.2025.105156_br0330) 1989; vol. 10305
Hochschild (10.1016/j.jpdc.2025.105156_br0010) 2021
Holmes (10.1016/j.jpdc.2025.105156_br0170) 2016
Gibbon (10.1016/j.jpdc.2025.105156_br0390) 2007
Ansel (10.1016/j.jpdc.2025.105156_br0070) 2009
Markidis (10.1016/j.jpdc.2025.105156_br0160) 2010; 80
Laguna (10.1016/j.jpdc.2025.105156_br0050) 2016; 30
Rocco (10.1016/j.jpdc.2025.105156_br0140) 2021
Garg (10.1016/j.jpdc.2025.105156_br0090) 2019
Hart (10.1016/j.jpdc.2025.105156_br0450) 1968; 4
McIntosh-Smith (10.1016/j.jpdc.2025.105156_br0150) 2017
Rocco (10.1016/j.jpdc.2025.105156_br0320) 2023
Hestenes (10.1016/j.jpdc.2025.105156_br0360) 1952
Reber (10.1016/j.jpdc.2025.105156_br0080)
Rocco (10.1016/j.jpdc.2025.105156_br0310) 2024
Rocco (10.1016/j.jpdc.2025.105156_br0280) 2023
Clarke (10.1016/j.jpdc.2025.105156_br0030) 1994
Brackbill (10.1016/j.jpdc.2025.105156_br0400) 1982; 46
Losada (10.1016/j.jpdc.2025.105156_br0110) 2019; 91
Duarte (10.1016/j.jpdc.2025.105156_br0270) 2014
Mallinson (10.1016/j.jpdc.2025.105156_br0350) 2013
References_xml – volume: 7
  start-page: 856
  year: 1986
  end-page: 869
  ident: br0410
  article-title: Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems
  publication-title: SIAM J. Sci. Stat. Comput.
– year: 2011
  ident: br0380
  article-title: Particle in cell methods with application to simulations in space weather
  publication-title: Lecture notes
– start-page: 121
  year: 2016
  end-page: 129
  ident: br0170
  article-title: MPI sessions: leveraging runtime infrastructure to increase scalability of applications at exascale
  publication-title: Proceedings of the 23rd European MPI Users' Group Meeting
– start-page: 191
  year: 2011
  end-page: 200
  ident: br0430
  article-title: Hierarchical data format 5: Hdf5
  publication-title: Handbook of Open Source Tools
– start-page: 9
  year: 2021
  end-page: 16
  ident: br0010
  article-title: Cores that don't count
  publication-title: Proceedings of the Workshop on Hot Topics in Operating Systems
– volume: 46
  start-page: 271
  year: 1982
  end-page: 308
  ident: br0400
  article-title: An implicit method for electromagnetic plasma simulation in two dimensions
  publication-title: J. Comput. Phys.
– volume: 27
  start-page: 244
  year: 2013
  end-page: 254
  ident: br0040
  article-title: Post-failure recovery of MPI communication capability: design and rationale
  publication-title: Int. J. High Perform. Comput. Appl.
– start-page: 51
  year: 2014
  end-page: 56
  ident: br0180
  article-title: Toward local failure local recovery resilience model using MPI-ULFM
  publication-title: Proceedings of the 21st European MPI Users' Group Meeting
– year: 2007
  ident: br0390
  article-title: Short Pulse Laser Interactions with Matter
– volume: 84
  start-page: 24
  year: 2015
  end-page: 36
  ident: br0220
  article-title: Intrinsic fault tolerance of multilevel Monte Carlo methods
  publication-title: J. Parallel Distrib. Comput.
– volume: 91
  start-page: 450
  year: 2019
  end-page: 464
  ident: br0110
  article-title: Local rollback for resilient MPI applications with application-level checkpointing and message logging
  publication-title: Future Gener. Comput. Syst.
– volume: 30
  start-page: 305
  year: 2016
  end-page: 319
  ident: br0050
  article-title: Evaluating and extending user-level fault tolerance in MPI applications
  publication-title: Int. J. High Perform. Comput. Appl.
– volume: 32
  year: 2020
  ident: br0340
  article-title: A survey of MPI usage in the US exascale computing project
  publication-title: Concurr. Comput. Pract. Exp.
– volume: 240
  start-page: 61
  year: 2024
  end-page: 69
  ident: br0440
  article-title: An overview of the legio fault resilience framework for mpi applications
  publication-title: Proc. Comput. Sci.
– year: 2021
  ident: br0020
  article-title: Silent data corruptions at scale
– start-page: 1
  year: 2009
  end-page: 12
  ident: br0070
  article-title: DMTCP: transparent checkpointing for cluster computations and the desktop
  publication-title: 2009 IEEE International Symposium on Parallel & Distributed Processing
– start-page: 49
  year: 2019
  end-page: 60
  ident: br0090
  article-title: MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing
  publication-title: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
– volume: 106
  start-page: 467
  year: 2020
  end-page: 481
  ident: br0300
  article-title: Fault tolerance of MPI applications in exascale systems: the ULFM solution
  publication-title: Future Gener. Comput. Syst.
– volume: 80
  start-page: 1509
  year: 2010
  end-page: 1519
  ident: br0160
  article-title: Multi-scale simulations of plasma with iPIC3D
  publication-title: Math. Comput. Simul.
– year: 2023
  ident: br0320
  article-title: Exploit approximation to support fault resiliency in MPI-based applications
  publication-title: 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume
– volume: vol. 46
  start-page: 494
  year: 2006
  ident: br0060
  article-title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
  publication-title: Journal of Physics: Conference Series
– start-page: 1
  year: 2018
  end-page: 11
  ident: br0100
  article-title: Energy-efficient localised rollback via data flow analysis and frequency scaling
  publication-title: Proceedings of the 25th European MPI Users' Group Meeting
– volume: 4
  start-page: 100
  year: 1968
  end-page: 107
  ident: br0450
  article-title: A formal basis for the heuristic determination of minimum cost paths
  publication-title: IEEE Trans. Syst. Sci. Cybern.
– volume: 33
  year: 2021
  ident: br0290
  article-title: Tree-based fault-tolerant collective operations for mpi
  publication-title: Concurr. Comput. Pract. Exp.
– year: 2012
  ident: br0080
  article-title: Criu: checkpoint/restore in userspace
– volume: 55
  start-page: 403
  year: 1983
  ident: br0370
  article-title: Particle simulation of plasmas
  publication-title: Rev. Mod. Phys.
– start-page: 57
  year: 2014
  end-page: 62
  ident: br0240
  article-title: Evaluating user-level fault tolerance for mpi applications
  publication-title: Proceedings of the 21st European MPI Users' Group Meeting
– volume: 75
  start-page: 5324
  year: 2019
  end-page: 5346
  ident: br0120
  article-title: Multidimensional hierarchical vm migration management for hpc cloud environments
  publication-title: J. Supercomput.
– year: 1952
  ident: br0360
  article-title: Methods of Conjugate Gradients for Solving Linear Systems, vol. 49
– start-page: 842
  year: 2017
  end-page: 849
  ident: br0150
  article-title: Tealeaf: a mini-application to enable design-space explorations for iterative sparse linear solvers
  publication-title: 2017 IEEE International Conference on Cluster Computing (CLUSTER)
– year: 2013
  ident: br0350
  article-title: Cloverleaf: Preparing hydrodynamics codes for exascale
– start-page: 1
  year: 2021
  end-page: 21
  ident: br0140
  article-title: Legio: fault resiliency for embarrassingly parallel MPI applications
  publication-title: J. Supercomput.
– year: 2023
  ident: br0280
  article-title: POSTER: the legio fault resilience framework: design and rationale
  publication-title: Proceedings of the 20th ACM International Conference on Computing Frontiers
– volume: vol. 10305
  start-page: 105
  year: 1989
  end-page: 114
  ident: br0330
  article-title: A Monte Carlo Model of Light Propagation in Tissue
  publication-title: Dosimetry of Laser Radiation in Medicine and Biology
– year: 2024
  ident: br0420
– start-page: 1
  year: 2015
  end-page: 12
  ident: br0210
  article-title: Local recovery and failure masking for stencil-based applications at extreme scales
  publication-title: Proceedings of SC
– start-page: 656
  year: 2015
  end-page: 668
  ident: br0230
  article-title: Towards understanding post-recovery efficiency for shrinking and non-shrinking recovery
  publication-title: European Conference on Parallel Processing
– start-page: 17
  year: 2014
  end-page: 22
  ident: br0270
  article-title: VCube: a provably scalable distributed diagnosis algorithm
  publication-title: 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
– start-page: 213
  year: 1994
  end-page: 218
  ident: br0030
  article-title: The MPI message passing interface standard
  publication-title: Programming Environments for Massively Parallel Distributed Systems
– start-page: 44
  year: 2024
  end-page: 51
  ident: br0310
  article-title: Extending the legio resilience framework to handle critical process failures in mpi
  publication-title: 2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
– start-page: 178
  year: 2018
  end-page: 185
  ident: br0130
  article-title: Shrink or substitute: handling process failures in HPC systems using in-situ recovery
  publication-title: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)
– year: 2009
  ident: br0190
  article-title: Improving Performance via Mini-applications
– volume: 19
  start-page: 1628
  year: 2008
  end-page: 1641
  ident: br0260
  article-title: Algorithm-based fault tolerance for fail-stop failures
  publication-title: IEEE Trans. Parallel Distrib. Syst.
– volume: 73
  start-page: 316
  year: 2017
  end-page: 329
  ident: br0200
  article-title: Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
  publication-title: J. Supercomput.
– volume: 47
  start-page: 225
  year: 2012
  end-page: 234
  ident: br0250
  article-title: Algorithm-based fault tolerance for dense matrix factorizations
  publication-title: ACM SIGPLAN Not.
– volume: 2
  start-page: 83
  year: 1959
  ident: br0460
  article-title: Spontaneously growing transverse waves in a plasma due to an anisotropic velocity distribution
  publication-title: Phys. Rev. Lett.
– ident: 10.1016/j.jpdc.2025.105156_br0020
– start-page: 1
  year: 2015
  ident: 10.1016/j.jpdc.2025.105156_br0210
  article-title: Local recovery and failure masking for stencil-based applications at extreme scales
– start-page: 1
  year: 2009
  ident: 10.1016/j.jpdc.2025.105156_br0070
  article-title: DMTCP: transparent checkpointing for cluster computations and the desktop
– year: 2023
  ident: 10.1016/j.jpdc.2025.105156_br0280
  article-title: POSTER: the legio fault resilience framework: design and rationale
– volume: 33
  issue: 14
  year: 2021
  ident: 10.1016/j.jpdc.2025.105156_br0290
  article-title: Tree-based fault-tolerant collective operations for mpi
  publication-title: Concurr. Comput. Pract. Exp.
  doi: 10.1002/cpe.5826
– volume: 32
  issue: 3
  year: 2020
  ident: 10.1016/j.jpdc.2025.105156_br0340
  article-title: A survey of MPI usage in the US exascale computing project
  publication-title: Concurr. Comput. Pract. Exp.
  doi: 10.1002/cpe.4851
– volume: 80
  start-page: 1509
  year: 2010
  ident: 10.1016/j.jpdc.2025.105156_br0160
  article-title: Multi-scale simulations of plasma with iPIC3D
  publication-title: Math. Comput. Simul.
  doi: 10.1016/j.matcom.2009.08.038
– year: 2009
  ident: 10.1016/j.jpdc.2025.105156_br0190
– volume: 55
  start-page: 403
  year: 1983
  ident: 10.1016/j.jpdc.2025.105156_br0370
  article-title: Particle simulation of plasmas
  publication-title: Rev. Mod. Phys.
  doi: 10.1103/RevModPhys.55.403
– volume: 30
  start-page: 305
  issue: 3
  year: 2016
  ident: 10.1016/j.jpdc.2025.105156_br0050
  article-title: Evaluating and extending user-level fault tolerance in MPI applications
  publication-title: Int. J. High Perform. Comput. Appl.
  doi: 10.1177/1094342015623623
– volume: 27
  start-page: 244
  issue: 3
  year: 2013
  ident: 10.1016/j.jpdc.2025.105156_br0040
  article-title: Post-failure recovery of MPI communication capability: design and rationale
  publication-title: Int. J. High Perform. Comput. Appl.
  doi: 10.1177/1094342013488238
– volume: 73
  start-page: 316
  year: 2017
  ident: 10.1016/j.jpdc.2025.105156_br0200
  article-title: Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-016-1863-z
– year: 2013
  ident: 10.1016/j.jpdc.2025.105156_br0350
– start-page: 656
  year: 2015
  ident: 10.1016/j.jpdc.2025.105156_br0230
  article-title: Towards understanding post-recovery efficiency for shrinking and non-shrinking recovery
– start-page: 213
  year: 1994
  ident: 10.1016/j.jpdc.2025.105156_br0030
  article-title: The MPI message passing interface standard
– ident: 10.1016/j.jpdc.2025.105156_br0080
– volume: vol. 46
  start-page: 494
  year: 2006
  ident: 10.1016/j.jpdc.2025.105156_br0060
  article-title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters
– start-page: 57
  year: 2014
  ident: 10.1016/j.jpdc.2025.105156_br0240
  article-title: Evaluating user-level fault tolerance for mpi applications
– volume: 19
  start-page: 1628
  issue: 12
  year: 2008
  ident: 10.1016/j.jpdc.2025.105156_br0260
  article-title: Algorithm-based fault tolerance for fail-stop failures
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2008.58
– year: 2011
  ident: 10.1016/j.jpdc.2025.105156_br0380
  article-title: Particle in cell methods with application to simulations in space weather
– volume: 240
  start-page: 61
  year: 2024
  ident: 10.1016/j.jpdc.2025.105156_br0440
  article-title: An overview of the legio fault resilience framework for mpi applications
  publication-title: Proc. Comput. Sci.
  doi: 10.1016/j.procs.2024.07.009
– volume: 2
  start-page: 83
  issue: 3
  year: 1959
  ident: 10.1016/j.jpdc.2025.105156_br0460
  article-title: Spontaneously growing transverse waves in a plasma due to an anisotropic velocity distribution
  publication-title: Phys. Rev. Lett.
  doi: 10.1103/PhysRevLett.2.83
– volume: 106
  start-page: 467
  year: 2020
  ident: 10.1016/j.jpdc.2025.105156_br0300
  article-title: Fault tolerance of MPI applications in exascale systems: the ULFM solution
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2020.01.026
– start-page: 178
  year: 2018
  ident: 10.1016/j.jpdc.2025.105156_br0130
  article-title: Shrink or substitute: handling process failures in HPC systems using in-situ recovery
– start-page: 17
  year: 2014
  ident: 10.1016/j.jpdc.2025.105156_br0270
  article-title: VCube: a provably scalable distributed diagnosis algorithm
– start-page: 9
  year: 2021
  ident: 10.1016/j.jpdc.2025.105156_br0010
  article-title: Cores that don't count
– start-page: 842
  year: 2017
  ident: 10.1016/j.jpdc.2025.105156_br0150
  article-title: Tealeaf: a mini-application to enable design-space explorations for iterative sparse linear solvers
– year: 2007
  ident: 10.1016/j.jpdc.2025.105156_br0390
– start-page: 51
  year: 2014
  ident: 10.1016/j.jpdc.2025.105156_br0180
  article-title: Toward local failure local recovery resilience model using MPI-ULFM
– start-page: 44
  year: 2024
  ident: 10.1016/j.jpdc.2025.105156_br0310
  article-title: Extending the legio resilience framework to handle critical process failures in mpi
– start-page: 191
  year: 2011
  ident: 10.1016/j.jpdc.2025.105156_br0430
  article-title: Hierarchical data format 5: Hdf5
– volume: vol. 10305
  start-page: 105
  year: 1989
  ident: 10.1016/j.jpdc.2025.105156_br0330
  article-title: A Monte Carlo Model of Light Propagation in Tissue
– start-page: 1
  year: 2021
  ident: 10.1016/j.jpdc.2025.105156_br0140
  article-title: Legio: fault resiliency for embarrassingly parallel MPI applications
  publication-title: J. Supercomput.
– ident: 10.1016/j.jpdc.2025.105156_br0420
– volume: 84
  start-page: 24
  year: 2015
  ident: 10.1016/j.jpdc.2025.105156_br0220
  article-title: Intrinsic fault tolerance of multilevel Monte Carlo methods
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2015.07.005
– volume: 7
  start-page: 856
  year: 1986
  ident: 10.1016/j.jpdc.2025.105156_br0410
  article-title: Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems
  publication-title: SIAM J. Sci. Stat. Comput.
  doi: 10.1137/0907058
– year: 2023
  ident: 10.1016/j.jpdc.2025.105156_br0320
  article-title: Exploit approximation to support fault resiliency in MPI-based applications
– start-page: 1
  year: 2018
  ident: 10.1016/j.jpdc.2025.105156_br0100
  article-title: Energy-efficient localised rollback via data flow analysis and frequency scaling
– volume: 46
  start-page: 271
  year: 1982
  ident: 10.1016/j.jpdc.2025.105156_br0400
  article-title: An implicit method for electromagnetic plasma simulation in two dimensions
  publication-title: J. Comput. Phys.
  doi: 10.1016/0021-9991(82)90016-X
– volume: 91
  start-page: 450
  year: 2019
  ident: 10.1016/j.jpdc.2025.105156_br0110
  article-title: Local rollback for resilient MPI applications with application-level checkpointing and message logging
  publication-title: Future Gener. Comput. Syst.
  doi: 10.1016/j.future.2018.09.041
– start-page: 49
  year: 2019
  ident: 10.1016/j.jpdc.2025.105156_br0090
  article-title: MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing
– volume: 4
  start-page: 100
  issue: 2
  year: 1968
  ident: 10.1016/j.jpdc.2025.105156_br0450
  article-title: A formal basis for the heuristic determination of minimum cost paths
  publication-title: IEEE Trans. Syst. Sci. Cybern.
  doi: 10.1109/TSSC.1968.300136
– volume: 75
  start-page: 5324
  year: 2019
  ident: 10.1016/j.jpdc.2025.105156_br0120
  article-title: Multidimensional hierarchical vm migration management for hpc cloud environments
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-019-02799-5
– year: 1952
  ident: 10.1016/j.jpdc.2025.105156_br0360
– start-page: 121
  year: 2016
  ident: 10.1016/j.jpdc.2025.105156_br0170
  article-title: MPI sessions: leveraging runtime infrastructure to increase scalability of applications at exascale
– volume: 47
  start-page: 225
  issue: 8
  year: 2012
  ident: 10.1016/j.jpdc.2025.105156_br0250
  article-title: Algorithm-based fault tolerance for dense matrix factorizations
  publication-title: ACM SIGPLAN Not.
  doi: 10.1145/2370036.2145845
SSID ssj0011578
Score 2.4268222
Snippet With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application...
SourceID crossref
elsevier
SourceType Index Database
Publisher
StartPage 105156
SubjectTerms Checkpoint and restart
Fault resilience
iPiC3D
Legio
MPI
Stencil applications
User level fault mitigation extension
Title To repair or not to repair: Assessing fault resilience in MPI stencil applications
URI https://dx.doi.org/10.1016/j.jpdc.2025.105156
Volume 205
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5KvXjxLT7LHrxJLOkmm6y3UiytYhFtobewr0BKSUuaXv3t7mQTqSAePGbIQPh2Mw9m5huAOyZTw42IPcp47AVaKM9mttTzlQ0PYtFj2lQNshM2mgXP83DegkEzC4NtlbXtdza9sta1pFuj2V1nWfcDnV9EcX-PIyLDCfYgwlv-8Pnd5oFcMnFDxYlv14MzrsdrsdZIY9gLcd2tj0usf3NOOw5neAQHdaRI-u5jjqFl8hM4bLYwkPqnPIX36YoU1qlkBVkVJF-VpGwEj8TVdK17IqnYLksr32TLSpFkOXl9G5MNBs3ZkuxWss9gNnyaDkZevSnBUz6LSy8NUuorhbGCVqG2KKSR5NSXVAoWmJ6yTjmQSD0fc8GFlIYZzXUUqZRSGSp6Du18lZsLIDYjtkGb1BFX1B4d45iASGsXQsGxxHgJ9w1EydoRYiRNp9giQUATBDRxgF5C2KCY_DjWxFrsP_Su_ql3Dfv45IYFb6BdFltza6OGUnaqa9GBvf74ZTT5ArEgwGU
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La4NAEB7S5NBe-i5Nn3vorUgwq6vbWwgNpnlQ2gRyE3ddwRA0GPP_u5PVkkLpodeRAflWv5lhZr8BeGIiUVxFvkUZ9y0njqSlK1tq2VKnB37UZbHaDchOWTB33hbuogH9-i4MjlVW3G84fcfWlaVTodlZp2nnE4OfR3F_jxEiO4AWqlO5TWj1hqNg-t1MsF1DyKjGiQ7V3Rkz5rVcx6hk2HVx462Ne6x_i097MWdwCsdVskh65n3OoKGyczipFzGQ6r-8gI9ZTgodV9KC5AXJ8pKUteGFmLaujlAkibarUts36WrnSNKMTN6HZIN5c7oi-83sS5gPXmf9wKqWJVjSZn5pJU5CbSkxXYilGzseSzzBqS2oiJijulLHZUeg-rzPIx4JoZiKeex5MqFUuJJeQTPLM3UNRBfFOm8Tsccl1afHONYgQlODG3HsMrbhuYYoXBtNjLAeFluGCGiIgIYG0Da4NYrhj5MNNWn_4XfzT79HOAxmk3E4Hk5Ht3CET8zdwTtolsVW3eskohQP1UfyBWQBwxY
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=To+repair+or+not+to+repair%3A+Assessing+fault+resilience+in+MPI+stencil+applications&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Rocco%2C+Roberto&rft.au=Boella%2C+Elisabetta&rft.au=Gregori%2C+Daniele&rft.au=Palermo%2C+Gianluca&rft.date=2025-11-01&rft.pub=Elsevier+Inc&rft.issn=0743-7315&rft.volume=205&rft_id=info:doi/10.1016%2Fj.jpdc.2025.105156&rft.externalDocID=S0743731525001236
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon