Prediction of Energy Consumption by Checkpoint/Restart in HPC

M. Moran, J. Balladini, D. Rexachs, E. Luque

Research output: Contribution to journalArticleResearch

4 Citations (Scopus)

Abstract

The fault tolerance method most used today in high-performance computing (HPC) is coordinated checkpointing. This, like any other fault tolerance method, adds additional energy consumption to that of the execution of the application. Currently, knowing and minimizing this energy consumption is a challenge. The objective of this paper is to propose a model to estimate the energy consumption of checkpoint and restart operations and a method for its construction. These estimates allow the evaluation of different scenarios in order to minimize energy consumption. We focus on coordinated checkpoint/restart at the system level, in single-program multiple-data (SPMD) applications, on homogeneous clusters. We study the behavior of the power dissipated by the compute node during a checkpoint/restart operation, as well as its execution time, considering different parameters of the system and the application. The experimentation carried out on two platforms shows the validity of the proposal. We also evaluate the impact on power and energy consumption of the processor's C states, the configuration of the network file system (NFS), where the checkpoint files are stored, and the compression of the checkpoint files. This paper contributes to the objective of predicting energy consumption in the execution of applications that use checkpoint/restart. Not counting the outliers, we can estimate the energy consumed by checkpoint/restart operations with errors lower than 7.5%.
Original languageEnglish
Article number8727526
Pages (from-to)71791-71803
Number of pages13
JournalIEEE Access
Volume7
DOIs
Publication statusPublished - 1 Jan 2019

Keywords

  • Checkpointing
  • energy consumption
  • fault tolerance
  • high performance computing

Fingerprint

Dive into the research topics of 'Prediction of Energy Consumption by Checkpoint/Restart in HPC'. Together they form a unique fingerprint.

Cite this