Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

Alvaro Wong*, Elisa Heymann, Dolores Rexachs, Emilio Luque

*Corresponding author for this work

Research output: Contribution to journalArticleResearchpeer-review

1 Citation (Scopus)

Abstract

Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.

Original languageEnglish
Article number9165191
Pages (from-to)254-268
Number of pages15
JournalIEEE Transactions on Parallel and Distributed Systems
Volume32
Issue number2
DOIs
Publication statusPublished - 11 Aug 2020

Keywords

  • checkpoint scalability
  • checkpoint-restart libraries
  • Fault tolerance
  • MPI
  • DESIGN
  • COMPUTATIONS
  • Middleware
  • ROLLBACK
  • Standards
  • RECOVERY
  • Fault tolerant systems
  • MPI APPLICATIONS
  • Computer architecture
  • Libraries
  • Computational efficiency

Fingerprint

Dive into the research topics of 'Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints'. Together they form a unique fingerprint.

Cite this