TY - JOUR
T1 - Analysis of parallel application checkpoint storage for system configuration
AU - León, Betzabeth
AU - Franco, Daniel
AU - Rexachs, Dolores
AU - Luque, Emilio
N1 - Publisher Copyright:
© 2020, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2021/5
Y1 - 2021/5
N2 - The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.
AB - The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.
KW - Checkpoint
KW - Fault tolerance
KW - HPC systems
KW - MPI application
KW - Scalability
UR - http://www.scopus.com/inward/record.url?scp=85092557163&partnerID=8YFLogxK
U2 - 10.1007/s11227-020-03445-1
DO - 10.1007/s11227-020-03445-1
M3 - Article
AN - SCOPUS:85092557163
SN - 0920-8542
VL - 77
SP - 4582
EP - 4617
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 5
ER -