TY - JOUR
T1 - Predictive and distributed routing balancing, an application-aware approach
AU - Castillo, Carlos Núñez
AU - Lugones, Diego
AU - Franco, Daniel
AU - Luque, Emilio
AU - Collier, Martin
N1 - Funding Information:
This research has been supported by the MEC-MICINN Spain under contract TIN2007-64974. Furthermore, we thank OPNET Technologies, Inc. for providing us the OPNET Modeler licenses
PY - 2013
Y1 - 2013
N2 - The interconnection design in computing clusters and data centers is expected to change significantly in the near future to sustain the increasing communication demand at controlled capitalization and operational cost. In particular, a shift from typical and expensive full-bisection bandwidth interconnects (which safely cover the worst communication cases) to application oriented designs (which may provide cost-efficient data movement at larger system scales) is devised in academic research and industry initiatives. Having information of communication dynamics of applications (e.g. repetitiveness, computing and communication phases, traffic pattern and bandwidth, etc. ) allows for efficiently managing and provisioning of network resources at reduced cost. This paper presents an Application-Aware Predictive and Distributed Routing Balancing technique (PR-DRB), a new method that controls network inefficiencies based on communication patterns of applications and speculative routing, PR-DRB monitors increments in the communication latency and, then, dynamically re-distributes the network traffic over multiple paths (path expansion) to deal with load unbalances. Additionally, PR-DRB stores the number of paths used to balance the traffic (solution) and links it to the application's pattern that caused the unbalance (problem). This information allows PR-DRB to respond to similar situations in repetitive patterns, quickly converging to a stable solution. Evaluation results show latency and completion time reductions of up to 37% for experiments conducted on 64 nodes executing the NAS benchmarks and the Lammps application.
AB - The interconnection design in computing clusters and data centers is expected to change significantly in the near future to sustain the increasing communication demand at controlled capitalization and operational cost. In particular, a shift from typical and expensive full-bisection bandwidth interconnects (which safely cover the worst communication cases) to application oriented designs (which may provide cost-efficient data movement at larger system scales) is devised in academic research and industry initiatives. Having information of communication dynamics of applications (e.g. repetitiveness, computing and communication phases, traffic pattern and bandwidth, etc. ) allows for efficiently managing and provisioning of network resources at reduced cost. This paper presents an Application-Aware Predictive and Distributed Routing Balancing technique (PR-DRB), a new method that controls network inefficiencies based on communication patterns of applications and speculative routing, PR-DRB monitors increments in the communication latency and, then, dynamically re-distributes the network traffic over multiple paths (path expansion) to deal with load unbalances. Additionally, PR-DRB stores the number of paths used to balance the traffic (solution) and links it to the application's pattern that caused the unbalance (problem). This information allows PR-DRB to respond to similar situations in repetitive patterns, quickly converging to a stable solution. Evaluation results show latency and completion time reductions of up to 37% for experiments conducted on 64 nodes executing the NAS benchmarks and the Lammps application.
KW - Application-aware routing
KW - High performance computing
KW - Hpc clusters
KW - Interconnection networks
KW - Parallel scientific applications
KW - Predictive routing
UR - http://www.scopus.com/inward/record.url?scp=84897003551&partnerID=8YFLogxK
U2 - 10.1016/j.procs.2013.05.181
DO - 10.1016/j.procs.2013.05.181
M3 - Article
AN - SCOPUS:84897003551
SN - 1877-0509
VL - 18
SP - 179
EP - 188
JO - Procedia Computer Science
JF - Procedia Computer Science
ER -