Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

Alvaro Wong*, Elisa Heymann, Dolores Rexachs, Emilio Luque

*Autor corresponent d’aquest treball

Producció científica: Contribució a revistaArticleRecercaAvaluat per experts

4 Cites (Scopus)

Resum

Compute node failures are becoming a normal event for many long-running and scalable MPI applications. Keeping within the MPI standards and applying some of the methods developed so far in terms of fault tolerance, we developed a methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture. To do this, we developed the ULSC2-RADIC middleware that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi-coordinated environment. We performed experimental results using scientific applications and the NAS Parallel Benchmarks to assess the overhead and also the functionality in case of a node failure. We evaluated the computational cost of the semi-coordinated checkpoints compared with the coordinated checkpoints.

Idioma originalAnglès
Número d’article9165191
Pàgines (de-a)254-268
Nombre de pàgines15
RevistaIEEE Transactions on Parallel and Distributed Systems
Volum32
Número2
DOIs
Estat de la publicacióPublicada - 11 d’ag. 2020

Fingerprint

Navegar pels temes de recerca de 'Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints'. Junts formen un fingerprint únic.

Com citar-ho