A Methodical Approach to Parallel IO Analysis in Distributed Deep Learning Applications

Research output: Chapter in BookChapterResearchpeer-review

2 Downloads (Pure)

Abstract

Deep learning applications have become crucially important for the analysis and prediction of massive volumes of data. However, these applications impose substantial input/output (I/O) loads on computing systems. Specifically, when running on distributed memory systems, they manage large amounts of data that must be accessed from parallel file systems during the training stage using the available I/O software stack. These accesses are inherently intensive and highly concurrent, which can saturate systems and adversely impact application performance. Consequently, the challenge lies in efficiently utilizing the I/O system to allow these applications to scale. When the volume of data increases, access can generate high training latency and add overhead significantly when data exceeds the main memory capacity. Therefore, it is essential to analyze the behavior of the I/O patterns generated during the training stage by reading the data set to analyze the behavior when the application scales and what amount of resources it will need. The paper presents a methodology to analyze parallel I/O patterns in Deep Learning applications in this context. Our methodological approach mainly aims at providing users with complete and accurate information. This involves a thorough understanding of how the application, the dataset, and the system parameters can significantly influence the parallel I/O of their deep learning application. We seek to empower users to make informed decisions through a structured methodology that allows them to identify and modify configurable elements effectively.
Original languageEnglish
Title of host publicationCommunications in Computer and Information Science. CSCE 2024.
Pages3-19
Number of pages17
Volume2256
ISBN (Electronic)978-3-031-85638-9
DOIs
Publication statusPublished - 26 Mar 2025

Publication series

NameCommunications in Computer and Information Science
Volume2256 CCIS

Keywords

  • Distributed Deep Learning
  • HPC cluster
  • I/O Analysis
  • I/O behavior patterns
  • Parallel I/O

Fingerprint

Dive into the research topics of 'A Methodical Approach to Parallel IO Analysis in Distributed Deep Learning Applications'. Together they form a unique fingerprint.

Cite this