Abstract
Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.
Original language | English |
---|---|
Article number | 9165191 |
Pages (from-to) | 254-268 |
Number of pages | 15 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 32 |
Issue number | 2 |
DOIs | |
Publication status | Published - 11 Aug 2020 |
Keywords
- checkpoint scalability
- checkpoint-restart libraries
- Fault tolerance
- MPI
- DESIGN
- COMPUTATIONS
- Middleware
- ROLLBACK
- Standards
- RECOVERY
- Fault tolerant systems
- MPI APPLICATIONS
- Computer architecture
- Libraries
- Computational efficiency