TY - CHAP
T1 - A Methodical Approach to Parallel IO Analysis in Distributed Deep Learning Applications
AU - Parraga Pinzon, Edixon Alexander
AU - Leon Otero, Betzabeth del Carmen
AU - Mendez , Sandra Adriana
AU - Rexachs, Dolores
AU - Suppi, Remo
AU - Luque, Emilio
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025/3/26
Y1 - 2025/3/26
N2 - Deep learning applications have become crucially important for the analysis and prediction of massive volumes of data. However, these applications impose substantial input/output (I/O) loads on computing systems. Specifically, when running on distributed memory systems, they manage large amounts of data that must be accessed from parallel file systems during the training stage using the available I/O software stack. These accesses are inherently intensive and highly concurrent, which can saturate systems and adversely impact application performance. Consequently, the challenge lies in efficiently utilizing the I/O system to allow these applications to scale. When the volume of data increases, access can generate high training latency and add overhead significantly when data exceeds the main memory capacity. Therefore, it is essential to analyze the behavior of the I/O patterns generated during the training stage by reading the data set to analyze the behavior when the application scales and what amount of resources it will need. The paper presents a methodology to analyze parallel I/O patterns in Deep Learning applications in this context. Our methodological approach mainly aims at providing users with complete and accurate information. This involves a thorough understanding of how the application, the dataset, and the system parameters can significantly influence the parallel I/O of their deep learning application. We seek to empower users to make informed decisions through a structured methodology that allows them to identify and modify configurable elements effectively.
AB - Deep learning applications have become crucially important for the analysis and prediction of massive volumes of data. However, these applications impose substantial input/output (I/O) loads on computing systems. Specifically, when running on distributed memory systems, they manage large amounts of data that must be accessed from parallel file systems during the training stage using the available I/O software stack. These accesses are inherently intensive and highly concurrent, which can saturate systems and adversely impact application performance. Consequently, the challenge lies in efficiently utilizing the I/O system to allow these applications to scale. When the volume of data increases, access can generate high training latency and add overhead significantly when data exceeds the main memory capacity. Therefore, it is essential to analyze the behavior of the I/O patterns generated during the training stage by reading the data set to analyze the behavior when the application scales and what amount of resources it will need. The paper presents a methodology to analyze parallel I/O patterns in Deep Learning applications in this context. Our methodological approach mainly aims at providing users with complete and accurate information. This involves a thorough understanding of how the application, the dataset, and the system parameters can significantly influence the parallel I/O of their deep learning application. We seek to empower users to make informed decisions through a structured methodology that allows them to identify and modify configurable elements effectively.
KW - Distributed Deep Learning
KW - HPC cluster
KW - I/O Analysis
KW - I/O behavior patterns
KW - Parallel I/O
UR - https://portalrecerca.uab.cat/en/publications/c88e6cbd-8989-4a33-8e2b-a85fa26058af
UR - http://www.scopus.com/inward/record.url?scp=105002010156&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/80c194a2-8419-39c3-8173-c942048bc552/
U2 - 10.1007/978-3-031-85638-9_1
DO - 10.1007/978-3-031-85638-9_1
M3 - Chapter
SN - 978-3-031-85638-9
SN - 978-3-031-85637-2
VL - 2256
T3 - Communications in Computer and Information Science
SP - 3
EP - 19
BT - Communications in Computer and Information Science. CSCE 2024.
ER -