Evaluating Modern GPU Interconnect
Ang Li et al
Last updated
Copyright Continuum Labs - 2023
Ang Li et al
Last updated
This highly cited 2019 paper presents a thorough evaluation of the latest GPU interconnect technologies.
The authors recognise the increasing importance of multi-GPU computing in various domains such as deep learning, big data, and large-scale simulations.
However, they highlight the lack of understanding about how modern GPUs are interconnected and the impact of these interconnects on multi-GPU application performance.
PCIe (Peripheral Component Interconnect Express)
NVLink-V1 (NVIDIA NVLink Version 1)
NVLink-V2 (NVIDIA NVLink Version 2)
NVLink-SLI (NVIDIA NVLink with Scalable Link Interface)
NVSwitch (NVIDIA NVSwitch)
NVIDIA P100-DGX-1
V100-DGX-1
DGX-2
OLCF's SummitDev supercomputer
OLCF's Summit supercomputer
An SLI-linked system with two NVIDIA Turing RTX-2080 GPUs
Raw startup latency
Sustainable uni-directional and bi-directional bandwidth
Network topology
Communication efficiency
Routing
NUMA (Non-Uniform Memory Access) effects
Peer-to-Peer (P2P): Direct communication between two GPUs
Collective (CL): Communication involving multiple GPUs
They identify four new types of GPU communication network NUMA effects:
Three are triggered by NVLink's topology, connectivity, and routing
One is caused by a PCIe chipset design issue
These NUMA effects indicate that choosing the right GPU combination can have a significant impact on GPU communication efficiency and overall application performance in a multi-GPU node.
These models are crucial for GPU task allocation, scheduling, and migration in shared environments like AI clouds and HPC centres
They can also guide communication-oriented performance tuning
This understanding can help in developing more mature multi-GPU programming, execution, and performance models
It can also aid in building reliable simulators for better application development and performance tuning
The paper provides a detailed evaluation of intra-node and inter-node collective communication (CL) performance on various platforms, including DGX-1, DGX-2, SLI-system, SummitDev, and Summit.
The authors use the NCCL library to measure the latency and bandwidth of different CL patterns, such as reduce, all-reduce, broadcast, reduce-scatter, and all-gather.
Key findings and well-explained technical details:
NCCL ring topology: The authors explain how NCCL constructs ring networks among the communication participants for efficient broadcasting and reduction operations on different interconnects (PCIe, NVLink-V1, and NVLink-V2).
CL latency and bandwidth: The paper presents detailed results on how CL latency and bandwidth vary with the number of participating GPUs, message size, and interconnect technology (PCIe, NVLink, NVSwitch).
Interconnect comparison: The authors highlight the differences in CL performance between PCIe (decreasing bandwidth with more GPUs) and NVLink (increasing bandwidth with more GPUs) due to their respective network topologies.
NUMA effects: The paper identifies significant NUMA effects for certain CL patterns (reduce-scatter and all-gather) with odd numbers of GPUs on NVLink, attributing this to NCCL's implementation rather than the interconnect topology.
NVSwitch performance: The authors demonstrate the superior CL bandwidth of NVSwitch compared to PCIe, particularly for reduce, all-reduce, and broadcast operations.
Inter-node CL: The paper evaluates inter-node CL performance on SummitDev and Summit supercomputers, highlighting the impact of GPUDirect-RDMA on latency and bandwidth, as well as the improvements in GPUDirect technology from SummitDev to Summit.
Based on these challenges, the authors propose three research directions:
Developing novel multi-GPU programming models that are adaptive, portable, tractable, and capable of addressing the aforementioned complexities. They suggest reconsidering the role of inter-GPU communication when designing new parallel models and algorithms, given the advancements in GPU interconnect technologies.
Developing practical multi-GPU performance models for performance prediction, optimisation, and analytics in multi-GPU application development and tuning. These models are also important for GPU task allocation, scheduling, and migration in shared environments like cloud and HPC centres.
Developing new communication patterns and libraries that better match the underlying interconnect and deliver high performance. The authors provide examples, such as efficiently distributing and exchanging data among dual-subnetwork interconnect topologies in Summit, and exploring new communication patterns like 2D-Torus for better efficiency compared to NCCL's ring pattern.
The authors plan to pursue these research directions in their future work, leveraging their past experience in GPU analytic modelling and performance optimisation.
Overall, the paper provides a comprehensive evaluation of modern GPU interconnect technologies and offers valuable insights into the challenges and opportunities for high-performance computing in multi-GPU systems.