TY - CHAP
T1 - Performance Optimization using Multimodal Modeling and Heterogeneous GNN
AU - Dutta, Akash
AU - Alcaraz, Jordi
AU - Tehranijamsaz, Ali
AU - Cesar, Eduardo
AU - Sikora, Anna
AU - Jannesari, Ali
N1 - Funding Information:
This research was supported by the National Science Foundation under Grant number 2211982. We would also like to thank the Re-searchIT team 1 at Iowa State University for their constant support. This work was also supported by the Ministerio de Ciencia e Inno-vación MCIN AEI/10.13039/501100011033 under contract PID2020-113614RB-C21 and by the Catalan government under contract 2021 SGR 00574.
Publisher Copyright:
© 2023 Owner/Author.
PY - 2023/7
Y1 - 2023/7
N2 - Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or their time to convergence is a significant barrier. There is, thus, a need for a general purpose and efficient tuning approach that can be easily scaled and adapted to various tuning tasks. We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks. In this paper, we analyze IR-based programming models to make task-specific performance optimizations. To this end, we propose the Multimodal Graph Neural Network and Autoencoder (MGA) tuner, a multimodal deep learning based approach that adapts Heterogeneous Graph Neural Networks and Denoising Autoencoders for modeling IR-based code representations that serve as separate modalities. This approach is used as part of our pipeline to model a syntax, semantics, and structure-aware IR-based code representation for tuning parallel code regions/kernels. We extensively experiment on OpenMP and OpenCL code regions/kernels obtained from PolyBench, Rodinia, STREAM, DataRaceBench, AMD SDK, NPB, NVIDIA SDK, Parboil, SHOC, LULESH, XSBench, RSBench, miniFE, miniAMR, and Quicksilver benchmarks and applications. We apply our multimodal learning techniques to the tasks of (i) optimizing the number of threads, scheduling policy and chunk size in OpenMP loops and, (ii) identifying the best device for heterogeneous device mapping of OpenCL kernels. Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in almost all experiments.
AB - Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or their time to convergence is a significant barrier. There is, thus, a need for a general purpose and efficient tuning approach that can be easily scaled and adapted to various tuning tasks. We propose a technique for tuning parallel code regions that is general enough to be adapted to multiple tasks. In this paper, we analyze IR-based programming models to make task-specific performance optimizations. To this end, we propose the Multimodal Graph Neural Network and Autoencoder (MGA) tuner, a multimodal deep learning based approach that adapts Heterogeneous Graph Neural Networks and Denoising Autoencoders for modeling IR-based code representations that serve as separate modalities. This approach is used as part of our pipeline to model a syntax, semantics, and structure-aware IR-based code representation for tuning parallel code regions/kernels. We extensively experiment on OpenMP and OpenCL code regions/kernels obtained from PolyBench, Rodinia, STREAM, DataRaceBench, AMD SDK, NPB, NVIDIA SDK, Parboil, SHOC, LULESH, XSBench, RSBench, miniFE, miniAMR, and Quicksilver benchmarks and applications. We apply our multimodal learning techniques to the tasks of (i) optimizing the number of threads, scheduling policy and chunk size in OpenMP loops and, (ii) identifying the best device for heterogeneous device mapping of OpenCL kernels. Our experiments show that this multimodal learning based approach outperforms the state-of-the-art in almost all experiments.
KW - auto-tuning
KW - heterogeneous graph neural networks
KW - multimodal learning
KW - OpenCL
KW - OpenMP
UR - https://www.scopus.com/pages/publications/85169596772
UR - https://www.mendeley.com/catalogue/cac46688-b4fb-312a-bf9d-38675a829f68/
UR - https://portalrecerca.uab.cat/en/publications/209f8964-b63a-4874-a18c-a4594acb28dd
U2 - 10.1145/3588195.3592984
DO - 10.1145/3588195.3592984
M3 - Chapter
AN - SCOPUS:85169596772
SN - 9798400701559
T3 - HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
SP - 45
EP - 57
BT - HPDC 2023 - Proceedings of the 32nd International Symposium on High-Performance Parallel and Distributed Computing
ER -