TY - JOUR
T1 - Exploring energy saving opportunities in fault tolerant HPC systems
AU - Morán, Marina
AU - Balladini, Javier
AU - Rexachs, Dolores
AU - Rucci, Enzo
N1 - Publisher Copyright:
© 2023 Elsevier Inc.
PY - 2024/3
Y1 - 2024/3
N2 - Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have enriched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest number of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by communication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
AB - Nowadays, improving the energy efficiency of high-performance computing (HPC) systems is one of the main drivers in scientific and technological research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be explored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure occurs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have enriched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest number of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by communication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
KW - Checkpoint
KW - DVFS
KW - Energy saving
KW - Fault tolerance methods
KW - HPC
UR - https://portalrecerca.uab.cat/en/publications/97192776-09f9-4e06-b5e6-666d9e7d09c3
UR - http://www.scopus.com/inward/record.url?scp=85177064632&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/8bb4cc06-577f-3c93-9e47-1bb273ba86de/
U2 - 10.1016/j.jpdc.2023.104797
DO - 10.1016/j.jpdc.2023.104797
M3 - Article
SN - 0743-7315
VL - 185
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
M1 - 104797
ER -