Towards management of energy consumption in HPC systems with fault tolerance

Marina Morán, Javier Balladini, Dolores Rexachs, Enzo Rucci

Research output: Chapter in BookChapterResearchpeer-review

Abstract

High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.

Original languageEnglish
Title of host publication2020 IEEE Congreso Bienal de Argentina, ARGENCON 2020 - 2020 IEEE Biennial Congress of Argentina, ARGENCON 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728159577
DOIs
Publication statusPublished - 1 Dec 2020

Publication series

Name2020 IEEE Congreso Bienal de Argentina, ARGENCON 2020 - 2020 IEEE Biennial Congress of Argentina, ARGENCON 2020

Keywords

  • ACPI
  • Distributed memory
  • DVFS
  • Energy consumption
  • Energy saving
  • Fault tolerance
  • HPC
  • MPI
  • Power management
  • Uncoordinated checkpoint

Fingerprint

Dive into the research topics of 'Towards management of energy consumption in HPC systems with fault tolerance'. Together they form a unique fingerprint.

Cite this