Gestión del Almacenamiento para Tolerancia a Fallos en Computación de Altas Prestaciones

Student thesis: Doctoral thesis


In HPC environments, it is essential to keep applications that require a long execution time running continuously. Redundancy is one of the methods used in HPC as a protection strategy against any failure, but generating an overhead due to redundant information implies additional time and resources to ensure the correct functioning of the system. Fault tolerance has become fundamental in ensuring system availability in high-performance computing environments. Among the strategies used is the rollback recovery, which consists of returning to a previous correct state previously saved. Checkpoints allow information on the state of a process to be saved periodically in a stable storage system. Still, a lot of latency is involved as all processes are concurrently accessing the file system. Also, checkpoint storage can affect parallel application performance and scalability that uses message passing. Therefore, it is important to know the elements that can impact checkpoint storage and how they can influence the scalability of a fault-tolerant application. For example, characterizing the files generated when performing the checkpoint of a parallel application is useful to determine the resources consumed and their impact on the I/O system. It is also important to characterize the application that performs the checkpoint because the I/O of the checkpoint depends mainly on it. The present research proposes a methodology that helps in configuring stable storage of the I/O files generated by fault tolerance, considering the access patterns to the generated files and the user requirements. This methodology has three phases in which the I/O patterns of the checkpoint are characterized. Then, the stable storage requirements are analyzed, and the behavior of the fault tolerance strategy is modeled. A model of prediction of checkpoint scalability has been proposed as part of the last phase of the methodology. This methodology can be useful when selecting which type of checkpoint configuration is most appropriate based on the characteristics of the applications and the available resources. Thus, the user will know how much storage space the checkpoint consumes and how much the application consumes to establish policies that help improve the distribution of resources.
Date of Award9 Mar 2023
Original languageSpanish
SupervisorDaniel Franco Puntes (Director) & Dolores Isabel Rexachs del Rosario (Director)

Cite this