To repair or not to repair: Assessing fault resilience in MPI stencil applications
With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This tren...
Saved in:
Published in | Journal of parallel and distributed computing Vol. 205; p. 105156 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Inc
01.11.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.
•Faults are becoming a critical issue in HPC executions, as MPI cannot handle them.•Checkpointing, while widespread, is time-consuming, disk demanding and poorly scalable.•Through fault resilience, we sacrifice result accuracy for faster recovery.•Experiments show that the loss of accuracy does not compromise result usability. |
---|---|
AbstractList | With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario.
•Faults are becoming a critical issue in HPC executions, as MPI cannot handle them.•Checkpointing, while widespread, is time-consuming, disk demanding and poorly scalable.•Through fault resilience, we sacrifice result accuracy for faster recovery.•Experiments show that the loss of accuracy does not compromise result usability. |
ArticleNumber | 105156 |
Author | Boella, Elisabetta Palermo, Gianluca Rocco, Roberto Gregori, Daniele |
Author_xml | – sequence: 1 givenname: Roberto orcidid: 0000-0002-0223-2900 surname: Rocco fullname: Rocco, Roberto email: roberto.rocco@e4company.com organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy – sequence: 2 givenname: Elisabetta orcidid: 0000-0003-1970-6794 surname: Boella fullname: Boella, Elisabetta email: elisabetta.boella@e4company.com organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy – sequence: 3 givenname: Daniele surname: Gregori fullname: Gregori, Daniele email: daniele.gregori@e4company.com organization: E4 Computer Engineering Spa, Viale Martiri della Libertà, 66, Scandiano RE, Italy – sequence: 4 givenname: Gianluca orcidid: 0000-0001-7955-8012 surname: Palermo fullname: Palermo, Gianluca email: gianluca.palermo@polimi.it organization: DEIB - Politecnico di Milano, Via Giuseppe Ponzio, 34, Milan, Italy |
BookMark | eNp9kM1KAzEURrOoYFt9AVd5ganJZJLMiJtS1BYqitR1yK9kGJMhiYJv79Tq1tXlfnA-7j0LMAsxWACuMFphhNl1v-pHo1c1qukUUEzZDMwRb0jFCabnYJFzjxDGlLdz8HKIMNlR-gRjgiEWWP6CG7jO2ebswxt08mMoU5794G3QFvoAH593MJdp8wOU4zh4LYuPIV-AMyeHbC9_5xK83t8dNttq__Sw26z3lcasLZVrHMFao7olRlPTcOa46ghWREnW2Fq3lDSK1ahpO9lJpSyzpjOca0eIoposQX3q1SnmnKwTY_LvMn0JjMTRhOjF0YQ4mhAnExN0e4LsdNmnt0lk_fOR8cnqIkz0_-HfitFr0Q |
Cites_doi | 10.1002/cpe.5826 10.1002/cpe.4851 10.1016/j.matcom.2009.08.038 10.1103/RevModPhys.55.403 10.1177/1094342015623623 10.1177/1094342013488238 10.1007/s11227-016-1863-z 10.1109/TPDS.2008.58 10.1016/j.procs.2024.07.009 10.1103/PhysRevLett.2.83 10.1016/j.future.2020.01.026 10.1016/j.jpdc.2015.07.005 10.1137/0907058 10.1016/0021-9991(82)90016-X 10.1016/j.future.2018.09.041 10.1109/TSSC.1968.300136 10.1007/s11227-019-02799-5 10.1145/2370036.2145845 |
ContentType | Journal Article |
Copyright | 2025 Elsevier Inc. |
Copyright_xml | – notice: 2025 Elsevier Inc. |
DBID | AAYXX CITATION |
DOI | 10.1016/j.jpdc.2025.105156 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
ExternalDocumentID | 10_1016_j_jpdc_2025_105156 S0743731525001236 |
GroupedDBID | --K --M -~X .~1 0R~ 1B1 1~. 1~5 29L 4.4 457 4G. 5GY 5VS 7-5 71M 8P~ 9JN AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN AAYWO ABBOA ABDPE ABEFU ABFNM ABFSI ABJNI ABMAC ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACVFH ACZNC ADBBV ADCNI ADEZE ADFGL ADHUB ADJOM ADMUD ADNMO ADTZH ADVLN AEBSH AECPX AEIPS AEKER AENEX AEUPX AFJKZ AFPUW AFTJW AGCQF AGHFR AGQPQ AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIGII AIIUN AIKHN AITUG AKBMS AKRWK AKYEP ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD APXCP ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 E.L EBS EFBJH EFKBS EJD EO8 EO9 EP2 EP3 F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-2 G-Q GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA K-O KOM LG5 LG9 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- ROL RPZ SBC SDF SDG SDP SES SET SEW SPC SPCBC SST SSV SSZ T5K TN5 TWZ WUQ XJT XOL XPP ZMT ZU3 ZY4 ~G- AAYXX CITATION |
ID | FETCH-LOGICAL-c168t-f4f31cc0283dc5d476f7b931b3ba64e2c8534b620489a9abbe6ed9d77cf33b5c3 |
IEDL.DBID | .~1 |
ISSN | 0743-7315 |
IngestDate | Thu Aug 21 00:06:52 EDT 2025 Sat Aug 30 17:13:23 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | Fault resilience Checkpoint and restart iPiC3D MPI Stencil applications Legio User level fault mitigation extension |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c168t-f4f31cc0283dc5d476f7b931b3ba64e2c8534b620489a9abbe6ed9d77cf33b5c3 |
ORCID | 0000-0003-1970-6794 0000-0002-0223-2900 0000-0001-7955-8012 |
ParticipantIDs | crossref_primary_10_1016_j_jpdc_2025_105156 elsevier_sciencedirect_doi_10_1016_j_jpdc_2025_105156 |
PublicationCentury | 2000 |
PublicationDate | November 2025 2025-11-00 |
PublicationDateYYYYMMDD | 2025-11-01 |
PublicationDate_xml | – month: 11 year: 2025 text: November 2025 |
PublicationDecade | 2020 |
PublicationTitle | Journal of parallel and distributed computing |
PublicationYear | 2025 |
Publisher | Elsevier Inc |
Publisher_xml | – name: Elsevier Inc |
References | Heroux, Doerfler, Crozier, Willenbring, Edwards, Williams, Rajan, Keiter, Thornquist, Numrich (br0190) 2009 Holmes, Mohror (br0170) 2016 Losada, Martín (br0200) 2017; 73 Dawson (br0370) 1983; 55 Rocco (br0140) 2021 Duarte, Bona, Ruoso (br0270) 2014 Clarke, Glendinning (br0030) 1994 Losada, Bosilca (br0110) 2019; 91 Filiposka, Mishev (br0120) 2019; 75 Du (br0250) 2012; 47 Chen (br0260) 2008; 19 Prahl (br0330) 1989; vol. 10305 Gibbon (br0390) 2007 Garg, Price, Cooperman (br0090) 2019 Dichev, Cameron, Nikolopoulos (br0100) 2018 McIntosh-Smith, Martineau, Deakin, Pawelczak, Gaudin, Garrett, Liu, Smedley-Stevenson, Beckingsale (br0150) 2017 Markidis, Lapenta, Rizwan-uddin (br0160) 2010; 80 Hart, Nilsson, Raphael (br0450) 1968; 4 Teranishi, Heroux (br0180) 2014 Bernholdt, Boehm, Bosilca, Gorentla Venkata, Grant, Naughton, Pritchard, Schulz, Vallee (br0340) 2020; 32 Saad, Schultz (br0410) 1986; 7 Rocco, Boella, Gregori, Palermo (br0440) 2024; 240 Rocco, Repetti, Boella, Gregori, Palermo (br0310) 2024 Lapenta (br0380) 2011 Hargrove, Duell (br0060) 2006; vol. 46 Laguna, Richards, Gamblin, Schulz, de Supinski (br0240) 2014 Rocco, Palermo (br0320) 2023 Fang, Fujita, Chien (br0230) 2015 Margolin, Barak (br0290) 2021; 33 Koranne (br0430) 2011 Hochschild (br0010) 2021 Ansel, Arya (br0070) 2009 Balay, Abhyankar, Adams, Benson, Brown, Brune, Buschelman, Constantinescu, Dalcin, Dener, Eijkhout, Faibussowitsch, Gropp, Hapla, Isaac, Jolivet, Karpeev, Kaushik, Knepley, Kong, Kruger, May, McInnes, Mills, Mitchell, Munson, Roman, Rupp, Sanan, Sarich, Smith, Zampini, Zhang, Zhang, Zhang, page (br0420) 2024 Reber (br0080) 2012 Mallinson, Beckingsale, Gaudin, Herdman, Levesque, Jarvis (br0350) 2013 Losada, González, Martín, Bosilca, Bouteiller, Teranishi (br0300) 2020; 106 Bland (br0040) 2013; 27 Ashraf, Hukerikar (br0130) 2018 Gamell (br0210) 2015 Laguna, Richards (br0050) 2016; 30 Rocco, Palermo (br0280) 2023 Brackbill, Forslund (br0400) 1982; 46 Weibel (br0460) 1959; 2 Hestenes, Stiefel (br0360) 1952 Dixit, Pendharkar (br0020) 2021 Pauli, Arbenz, Schwab (br0220) 2015; 84 Bland (10.1016/j.jpdc.2025.105156_br0040) 2013; 27 Chen (10.1016/j.jpdc.2025.105156_br0260) 2008; 19 Saad (10.1016/j.jpdc.2025.105156_br0410) 1986; 7 Balay (10.1016/j.jpdc.2025.105156_br0420) Weibel (10.1016/j.jpdc.2025.105156_br0460) 1959; 2 Teranishi (10.1016/j.jpdc.2025.105156_br0180) 2014 Bernholdt (10.1016/j.jpdc.2025.105156_br0340) 2020; 32 Fang (10.1016/j.jpdc.2025.105156_br0230) 2015 Dichev (10.1016/j.jpdc.2025.105156_br0100) 2018 Hargrove (10.1016/j.jpdc.2025.105156_br0060) 2006; vol. 46 Heroux (10.1016/j.jpdc.2025.105156_br0190) 2009 Losada (10.1016/j.jpdc.2025.105156_br0200) 2017; 73 Koranne (10.1016/j.jpdc.2025.105156_br0430) 2011 Dixit (10.1016/j.jpdc.2025.105156_br0020) Margolin (10.1016/j.jpdc.2025.105156_br0290) 2021; 33 Laguna (10.1016/j.jpdc.2025.105156_br0240) 2014 Dawson (10.1016/j.jpdc.2025.105156_br0370) 1983; 55 Losada (10.1016/j.jpdc.2025.105156_br0300) 2020; 106 Rocco (10.1016/j.jpdc.2025.105156_br0440) 2024; 240 Ashraf (10.1016/j.jpdc.2025.105156_br0130) 2018 Lapenta (10.1016/j.jpdc.2025.105156_br0380) 2011 Gamell (10.1016/j.jpdc.2025.105156_br0210) 2015 Du (10.1016/j.jpdc.2025.105156_br0250) 2012; 47 Filiposka (10.1016/j.jpdc.2025.105156_br0120) 2019; 75 Pauli (10.1016/j.jpdc.2025.105156_br0220) 2015; 84 Prahl (10.1016/j.jpdc.2025.105156_br0330) 1989; vol. 10305 Hochschild (10.1016/j.jpdc.2025.105156_br0010) 2021 Holmes (10.1016/j.jpdc.2025.105156_br0170) 2016 Gibbon (10.1016/j.jpdc.2025.105156_br0390) 2007 Ansel (10.1016/j.jpdc.2025.105156_br0070) 2009 Markidis (10.1016/j.jpdc.2025.105156_br0160) 2010; 80 Laguna (10.1016/j.jpdc.2025.105156_br0050) 2016; 30 Rocco (10.1016/j.jpdc.2025.105156_br0140) 2021 Garg (10.1016/j.jpdc.2025.105156_br0090) 2019 Hart (10.1016/j.jpdc.2025.105156_br0450) 1968; 4 McIntosh-Smith (10.1016/j.jpdc.2025.105156_br0150) 2017 Rocco (10.1016/j.jpdc.2025.105156_br0320) 2023 Hestenes (10.1016/j.jpdc.2025.105156_br0360) 1952 Reber (10.1016/j.jpdc.2025.105156_br0080) Rocco (10.1016/j.jpdc.2025.105156_br0310) 2024 Rocco (10.1016/j.jpdc.2025.105156_br0280) 2023 Clarke (10.1016/j.jpdc.2025.105156_br0030) 1994 Brackbill (10.1016/j.jpdc.2025.105156_br0400) 1982; 46 Losada (10.1016/j.jpdc.2025.105156_br0110) 2019; 91 Duarte (10.1016/j.jpdc.2025.105156_br0270) 2014 Mallinson (10.1016/j.jpdc.2025.105156_br0350) 2013 |
References_xml | – volume: 7 start-page: 856 year: 1986 end-page: 869 ident: br0410 article-title: Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems publication-title: SIAM J. Sci. Stat. Comput. – year: 2011 ident: br0380 article-title: Particle in cell methods with application to simulations in space weather publication-title: Lecture notes – start-page: 121 year: 2016 end-page: 129 ident: br0170 article-title: MPI sessions: leveraging runtime infrastructure to increase scalability of applications at exascale publication-title: Proceedings of the 23rd European MPI Users' Group Meeting – start-page: 191 year: 2011 end-page: 200 ident: br0430 article-title: Hierarchical data format 5: Hdf5 publication-title: Handbook of Open Source Tools – start-page: 9 year: 2021 end-page: 16 ident: br0010 article-title: Cores that don't count publication-title: Proceedings of the Workshop on Hot Topics in Operating Systems – volume: 46 start-page: 271 year: 1982 end-page: 308 ident: br0400 article-title: An implicit method for electromagnetic plasma simulation in two dimensions publication-title: J. Comput. Phys. – volume: 27 start-page: 244 year: 2013 end-page: 254 ident: br0040 article-title: Post-failure recovery of MPI communication capability: design and rationale publication-title: Int. J. High Perform. Comput. Appl. – start-page: 51 year: 2014 end-page: 56 ident: br0180 article-title: Toward local failure local recovery resilience model using MPI-ULFM publication-title: Proceedings of the 21st European MPI Users' Group Meeting – year: 2007 ident: br0390 article-title: Short Pulse Laser Interactions with Matter – volume: 84 start-page: 24 year: 2015 end-page: 36 ident: br0220 article-title: Intrinsic fault tolerance of multilevel Monte Carlo methods publication-title: J. Parallel Distrib. Comput. – volume: 91 start-page: 450 year: 2019 end-page: 464 ident: br0110 article-title: Local rollback for resilient MPI applications with application-level checkpointing and message logging publication-title: Future Gener. Comput. Syst. – volume: 30 start-page: 305 year: 2016 end-page: 319 ident: br0050 article-title: Evaluating and extending user-level fault tolerance in MPI applications publication-title: Int. J. High Perform. Comput. Appl. – volume: 32 year: 2020 ident: br0340 article-title: A survey of MPI usage in the US exascale computing project publication-title: Concurr. Comput. Pract. Exp. – volume: 240 start-page: 61 year: 2024 end-page: 69 ident: br0440 article-title: An overview of the legio fault resilience framework for mpi applications publication-title: Proc. Comput. Sci. – year: 2021 ident: br0020 article-title: Silent data corruptions at scale – start-page: 1 year: 2009 end-page: 12 ident: br0070 article-title: DMTCP: transparent checkpointing for cluster computations and the desktop publication-title: 2009 IEEE International Symposium on Parallel & Distributed Processing – start-page: 49 year: 2019 end-page: 60 ident: br0090 article-title: MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing publication-title: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing – volume: 106 start-page: 467 year: 2020 end-page: 481 ident: br0300 article-title: Fault tolerance of MPI applications in exascale systems: the ULFM solution publication-title: Future Gener. Comput. Syst. – volume: 80 start-page: 1509 year: 2010 end-page: 1519 ident: br0160 article-title: Multi-scale simulations of plasma with iPIC3D publication-title: Math. Comput. Simul. – year: 2023 ident: br0320 article-title: Exploit approximation to support fault resiliency in MPI-based applications publication-title: 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume – volume: vol. 46 start-page: 494 year: 2006 ident: br0060 article-title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters publication-title: Journal of Physics: Conference Series – start-page: 1 year: 2018 end-page: 11 ident: br0100 article-title: Energy-efficient localised rollback via data flow analysis and frequency scaling publication-title: Proceedings of the 25th European MPI Users' Group Meeting – volume: 4 start-page: 100 year: 1968 end-page: 107 ident: br0450 article-title: A formal basis for the heuristic determination of minimum cost paths publication-title: IEEE Trans. Syst. Sci. Cybern. – volume: 33 year: 2021 ident: br0290 article-title: Tree-based fault-tolerant collective operations for mpi publication-title: Concurr. Comput. Pract. Exp. – year: 2012 ident: br0080 article-title: Criu: checkpoint/restore in userspace – volume: 55 start-page: 403 year: 1983 ident: br0370 article-title: Particle simulation of plasmas publication-title: Rev. Mod. Phys. – start-page: 57 year: 2014 end-page: 62 ident: br0240 article-title: Evaluating user-level fault tolerance for mpi applications publication-title: Proceedings of the 21st European MPI Users' Group Meeting – volume: 75 start-page: 5324 year: 2019 end-page: 5346 ident: br0120 article-title: Multidimensional hierarchical vm migration management for hpc cloud environments publication-title: J. Supercomput. – year: 1952 ident: br0360 article-title: Methods of Conjugate Gradients for Solving Linear Systems, vol. 49 – start-page: 842 year: 2017 end-page: 849 ident: br0150 article-title: Tealeaf: a mini-application to enable design-space explorations for iterative sparse linear solvers publication-title: 2017 IEEE International Conference on Cluster Computing (CLUSTER) – year: 2013 ident: br0350 article-title: Cloverleaf: Preparing hydrodynamics codes for exascale – start-page: 1 year: 2021 end-page: 21 ident: br0140 article-title: Legio: fault resiliency for embarrassingly parallel MPI applications publication-title: J. Supercomput. – year: 2023 ident: br0280 article-title: POSTER: the legio fault resilience framework: design and rationale publication-title: Proceedings of the 20th ACM International Conference on Computing Frontiers – volume: vol. 10305 start-page: 105 year: 1989 end-page: 114 ident: br0330 article-title: A Monte Carlo Model of Light Propagation in Tissue publication-title: Dosimetry of Laser Radiation in Medicine and Biology – year: 2024 ident: br0420 – start-page: 1 year: 2015 end-page: 12 ident: br0210 article-title: Local recovery and failure masking for stencil-based applications at extreme scales publication-title: Proceedings of SC – start-page: 656 year: 2015 end-page: 668 ident: br0230 article-title: Towards understanding post-recovery efficiency for shrinking and non-shrinking recovery publication-title: European Conference on Parallel Processing – start-page: 17 year: 2014 end-page: 22 ident: br0270 article-title: VCube: a provably scalable distributed diagnosis algorithm publication-title: 2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems – start-page: 213 year: 1994 end-page: 218 ident: br0030 article-title: The MPI message passing interface standard publication-title: Programming Environments for Massively Parallel Distributed Systems – start-page: 44 year: 2024 end-page: 51 ident: br0310 article-title: Extending the legio resilience framework to handle critical process failures in mpi publication-title: 2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) – start-page: 178 year: 2018 end-page: 185 ident: br0130 article-title: Shrink or substitute: handling process failures in HPC systems using in-situ recovery publication-title: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) – year: 2009 ident: br0190 article-title: Improving Performance via Mini-applications – volume: 19 start-page: 1628 year: 2008 end-page: 1641 ident: br0260 article-title: Algorithm-based fault tolerance for fail-stop failures publication-title: IEEE Trans. Parallel Distrib. Syst. – volume: 73 start-page: 316 year: 2017 end-page: 329 ident: br0200 article-title: Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications publication-title: J. Supercomput. – volume: 47 start-page: 225 year: 2012 end-page: 234 ident: br0250 article-title: Algorithm-based fault tolerance for dense matrix factorizations publication-title: ACM SIGPLAN Not. – volume: 2 start-page: 83 year: 1959 ident: br0460 article-title: Spontaneously growing transverse waves in a plasma due to an anisotropic velocity distribution publication-title: Phys. Rev. Lett. – ident: 10.1016/j.jpdc.2025.105156_br0020 – start-page: 1 year: 2015 ident: 10.1016/j.jpdc.2025.105156_br0210 article-title: Local recovery and failure masking for stencil-based applications at extreme scales – start-page: 1 year: 2009 ident: 10.1016/j.jpdc.2025.105156_br0070 article-title: DMTCP: transparent checkpointing for cluster computations and the desktop – year: 2023 ident: 10.1016/j.jpdc.2025.105156_br0280 article-title: POSTER: the legio fault resilience framework: design and rationale – volume: 33 issue: 14 year: 2021 ident: 10.1016/j.jpdc.2025.105156_br0290 article-title: Tree-based fault-tolerant collective operations for mpi publication-title: Concurr. Comput. Pract. Exp. doi: 10.1002/cpe.5826 – volume: 32 issue: 3 year: 2020 ident: 10.1016/j.jpdc.2025.105156_br0340 article-title: A survey of MPI usage in the US exascale computing project publication-title: Concurr. Comput. Pract. Exp. doi: 10.1002/cpe.4851 – volume: 80 start-page: 1509 year: 2010 ident: 10.1016/j.jpdc.2025.105156_br0160 article-title: Multi-scale simulations of plasma with iPIC3D publication-title: Math. Comput. Simul. doi: 10.1016/j.matcom.2009.08.038 – year: 2009 ident: 10.1016/j.jpdc.2025.105156_br0190 – volume: 55 start-page: 403 year: 1983 ident: 10.1016/j.jpdc.2025.105156_br0370 article-title: Particle simulation of plasmas publication-title: Rev. Mod. Phys. doi: 10.1103/RevModPhys.55.403 – volume: 30 start-page: 305 issue: 3 year: 2016 ident: 10.1016/j.jpdc.2025.105156_br0050 article-title: Evaluating and extending user-level fault tolerance in MPI applications publication-title: Int. J. High Perform. Comput. Appl. doi: 10.1177/1094342015623623 – volume: 27 start-page: 244 issue: 3 year: 2013 ident: 10.1016/j.jpdc.2025.105156_br0040 article-title: Post-failure recovery of MPI communication capability: design and rationale publication-title: Int. J. High Perform. Comput. Appl. doi: 10.1177/1094342013488238 – volume: 73 start-page: 316 year: 2017 ident: 10.1016/j.jpdc.2025.105156_br0200 article-title: Assessing resilient versus stop-and-restart fault-tolerant solutions in MPI applications publication-title: J. Supercomput. doi: 10.1007/s11227-016-1863-z – year: 2013 ident: 10.1016/j.jpdc.2025.105156_br0350 – start-page: 656 year: 2015 ident: 10.1016/j.jpdc.2025.105156_br0230 article-title: Towards understanding post-recovery efficiency for shrinking and non-shrinking recovery – start-page: 213 year: 1994 ident: 10.1016/j.jpdc.2025.105156_br0030 article-title: The MPI message passing interface standard – ident: 10.1016/j.jpdc.2025.105156_br0080 – volume: vol. 46 start-page: 494 year: 2006 ident: 10.1016/j.jpdc.2025.105156_br0060 article-title: Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters – start-page: 57 year: 2014 ident: 10.1016/j.jpdc.2025.105156_br0240 article-title: Evaluating user-level fault tolerance for mpi applications – volume: 19 start-page: 1628 issue: 12 year: 2008 ident: 10.1016/j.jpdc.2025.105156_br0260 article-title: Algorithm-based fault tolerance for fail-stop failures publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2008.58 – year: 2011 ident: 10.1016/j.jpdc.2025.105156_br0380 article-title: Particle in cell methods with application to simulations in space weather – volume: 240 start-page: 61 year: 2024 ident: 10.1016/j.jpdc.2025.105156_br0440 article-title: An overview of the legio fault resilience framework for mpi applications publication-title: Proc. Comput. Sci. doi: 10.1016/j.procs.2024.07.009 – volume: 2 start-page: 83 issue: 3 year: 1959 ident: 10.1016/j.jpdc.2025.105156_br0460 article-title: Spontaneously growing transverse waves in a plasma due to an anisotropic velocity distribution publication-title: Phys. Rev. Lett. doi: 10.1103/PhysRevLett.2.83 – volume: 106 start-page: 467 year: 2020 ident: 10.1016/j.jpdc.2025.105156_br0300 article-title: Fault tolerance of MPI applications in exascale systems: the ULFM solution publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2020.01.026 – start-page: 178 year: 2018 ident: 10.1016/j.jpdc.2025.105156_br0130 article-title: Shrink or substitute: handling process failures in HPC systems using in-situ recovery – start-page: 17 year: 2014 ident: 10.1016/j.jpdc.2025.105156_br0270 article-title: VCube: a provably scalable distributed diagnosis algorithm – start-page: 9 year: 2021 ident: 10.1016/j.jpdc.2025.105156_br0010 article-title: Cores that don't count – start-page: 842 year: 2017 ident: 10.1016/j.jpdc.2025.105156_br0150 article-title: Tealeaf: a mini-application to enable design-space explorations for iterative sparse linear solvers – year: 2007 ident: 10.1016/j.jpdc.2025.105156_br0390 – start-page: 51 year: 2014 ident: 10.1016/j.jpdc.2025.105156_br0180 article-title: Toward local failure local recovery resilience model using MPI-ULFM – start-page: 44 year: 2024 ident: 10.1016/j.jpdc.2025.105156_br0310 article-title: Extending the legio resilience framework to handle critical process failures in mpi – start-page: 191 year: 2011 ident: 10.1016/j.jpdc.2025.105156_br0430 article-title: Hierarchical data format 5: Hdf5 – volume: vol. 10305 start-page: 105 year: 1989 ident: 10.1016/j.jpdc.2025.105156_br0330 article-title: A Monte Carlo Model of Light Propagation in Tissue – start-page: 1 year: 2021 ident: 10.1016/j.jpdc.2025.105156_br0140 article-title: Legio: fault resiliency for embarrassingly parallel MPI applications publication-title: J. Supercomput. – ident: 10.1016/j.jpdc.2025.105156_br0420 – volume: 84 start-page: 24 year: 2015 ident: 10.1016/j.jpdc.2025.105156_br0220 article-title: Intrinsic fault tolerance of multilevel Monte Carlo methods publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2015.07.005 – volume: 7 start-page: 856 year: 1986 ident: 10.1016/j.jpdc.2025.105156_br0410 article-title: Gmres: a generalized minimal residual algorithm for solving nonsymmetric linear systems publication-title: SIAM J. Sci. Stat. Comput. doi: 10.1137/0907058 – year: 2023 ident: 10.1016/j.jpdc.2025.105156_br0320 article-title: Exploit approximation to support fault resiliency in MPI-based applications – start-page: 1 year: 2018 ident: 10.1016/j.jpdc.2025.105156_br0100 article-title: Energy-efficient localised rollback via data flow analysis and frequency scaling – volume: 46 start-page: 271 year: 1982 ident: 10.1016/j.jpdc.2025.105156_br0400 article-title: An implicit method for electromagnetic plasma simulation in two dimensions publication-title: J. Comput. Phys. doi: 10.1016/0021-9991(82)90016-X – volume: 91 start-page: 450 year: 2019 ident: 10.1016/j.jpdc.2025.105156_br0110 article-title: Local rollback for resilient MPI applications with application-level checkpointing and message logging publication-title: Future Gener. Comput. Syst. doi: 10.1016/j.future.2018.09.041 – start-page: 49 year: 2019 ident: 10.1016/j.jpdc.2025.105156_br0090 article-title: MANA for MPI: MPI-agnostic network-agnostic transparent checkpointing – volume: 4 start-page: 100 issue: 2 year: 1968 ident: 10.1016/j.jpdc.2025.105156_br0450 article-title: A formal basis for the heuristic determination of minimum cost paths publication-title: IEEE Trans. Syst. Sci. Cybern. doi: 10.1109/TSSC.1968.300136 – volume: 75 start-page: 5324 year: 2019 ident: 10.1016/j.jpdc.2025.105156_br0120 article-title: Multidimensional hierarchical vm migration management for hpc cloud environments publication-title: J. Supercomput. doi: 10.1007/s11227-019-02799-5 – year: 1952 ident: 10.1016/j.jpdc.2025.105156_br0360 – start-page: 121 year: 2016 ident: 10.1016/j.jpdc.2025.105156_br0170 article-title: MPI sessions: leveraging runtime infrastructure to increase scalability of applications at exascale – volume: 47 start-page: 225 issue: 8 year: 2012 ident: 10.1016/j.jpdc.2025.105156_br0250 article-title: Algorithm-based fault tolerance for dense matrix factorizations publication-title: ACM SIGPLAN Not. doi: 10.1145/2370036.2145845 |
SSID | ssj0011578 |
Score | 2.4268222 |
Snippet | With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application... |
SourceID | crossref elsevier |
SourceType | Index Database Publisher |
StartPage | 105156 |
SubjectTerms | Checkpoint and restart Fault resilience iPiC3D Legio MPI Stencil applications User level fault mitigation extension |
Title | To repair or not to repair: Assessing fault resilience in MPI stencil applications |
URI | https://dx.doi.org/10.1016/j.jpdc.2025.105156 |
Volume | 205 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8NAEB5KvXjxLT7LHrxJLOkmm6y3UiytYhFtobewr0BKSUuaXv3t7mQTqSAePGbIQPh2Mw9m5huAOyZTw42IPcp47AVaKM9mttTzlQ0PYtFj2lQNshM2mgXP83DegkEzC4NtlbXtdza9sta1pFuj2V1nWfcDnV9EcX-PIyLDCfYgwlv-8Pnd5oFcMnFDxYlv14MzrsdrsdZIY9gLcd2tj0usf3NOOw5neAQHdaRI-u5jjqFl8hM4bLYwkPqnPIX36YoU1qlkBVkVJF-VpGwEj8TVdK17IqnYLksr32TLSpFkOXl9G5MNBs3ZkuxWss9gNnyaDkZevSnBUz6LSy8NUuorhbGCVqG2KKSR5NSXVAoWmJ6yTjmQSD0fc8GFlIYZzXUUqZRSGSp6Du18lZsLIDYjtkGb1BFX1B4d45iASGsXQsGxxHgJ9w1EydoRYiRNp9giQUATBDRxgF5C2KCY_DjWxFrsP_Su_ql3Dfv45IYFb6BdFltza6OGUnaqa9GBvf74ZTT5ArEgwGU |
linkProvider | Elsevier |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1La4NAEB7S5NBe-i5Nn3vorUgwq6vbWwgNpnlQ2gRyE3ddwRA0GPP_u5PVkkLpodeRAflWv5lhZr8BeGIiUVxFvkUZ9y0njqSlK1tq2VKnB37UZbHaDchOWTB33hbuogH9-i4MjlVW3G84fcfWlaVTodlZp2nnE4OfR3F_jxEiO4AWqlO5TWj1hqNg-t1MsF1DyKjGiQ7V3Rkz5rVcx6hk2HVx462Ne6x_i097MWdwCsdVskh65n3OoKGyczipFzGQ6r-8gI9ZTgodV9KC5AXJ8pKUteGFmLaujlAkibarUts36WrnSNKMTN6HZIN5c7oi-83sS5gPXmf9wKqWJVjSZn5pJU5CbSkxXYilGzseSzzBqS2oiJijulLHZUeg-rzPIx4JoZiKeex5MqFUuJJeQTPLM3UNRBfFOm8Tsccl1afHONYgQlODG3HsMrbhuYYoXBtNjLAeFluGCGiIgIYG0Da4NYrhj5MNNWn_4XfzT79HOAxmk3E4Hk5Ht3CET8zdwTtolsVW3eskohQP1UfyBWQBwxY |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=To+repair+or+not+to+repair%3A+Assessing+fault+resilience+in+MPI+stencil+applications&rft.jtitle=Journal+of+parallel+and+distributed+computing&rft.au=Rocco%2C+Roberto&rft.au=Boella%2C+Elisabetta&rft.au=Gregori%2C+Daniele&rft.au=Palermo%2C+Gianluca&rft.date=2025-11-01&rft.pub=Elsevier+Inc&rft.issn=0743-7315&rft.volume=205&rft_id=info:doi/10.1016%2Fj.jpdc.2025.105156&rft.externalDocID=S0743731525001236 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0743-7315&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0743-7315&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0743-7315&client=summon |