# Evaluating Modern GPU Interconnect

This highly cited <mark style="color:blue;">**2019**</mark> paper presents a thorough evaluation of the latest GPU interconnect technologies.&#x20;

The authors recognise the increasing importance of multi-GPU computing in various domains such as deep learning, big data, and large-scale simulations.&#x20;

However, they highlight the lack of understanding about how modern GPUs are interconnected and the impact of these interconnects on multi-GPU application performance.

{% embed url="<https://arxiv.org/abs/1903.04611>" %}
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect
{% endembed %}

#### <mark style="color:green;">The paper evaluates five types of GPU interconnects</mark>

1. <mark style="color:blue;">**PCIe**</mark> (Peripheral Component Interconnect Express)
2. <mark style="color:blue;">**NVLink-V1**</mark> (NVIDIA NVLink Version 1)
3. <mark style="color:blue;">**NVLink-V2**</mark> (NVIDIA NVLink Version 2)
4. <mark style="color:blue;">**NVLink-SLI**</mark> (NVIDIA NVLink with Scalable Link Interface)
5. <mark style="color:blue;">**NVSwitch**</mark> (NVIDIA NVSwitch)

#### <mark style="color:green;">The evaluation is conducted on six high-end servers and HPC platforms</mark>

1. NVIDIA P100-DGX-1
2. V100-DGX-1
3. DGX-2
4. OLCF's SummitDev supercomputer
5. OLCF's Summit supercomputer
6. An SLI-linked system with two NVIDIA Turing RTX-2080 GPUs

#### <mark style="color:green;">The authors measure various performance metrics for these interconnects, including</mark>

1. Raw startup <mark style="color:blue;">**latency**</mark>
2. Sustainable uni-directional and bi-directional <mark style="color:blue;">**bandwidth**</mark>
3. <mark style="color:blue;">**Network topology**</mark>
4. <mark style="color:blue;">**Communication efficiency**</mark>
5. <mark style="color:blue;">**Routing**</mark>
6. <mark style="color:blue;">**NUMA**</mark> (Non-Uniform Memory Access) effects

#### <mark style="color:green;">They consider two communication patterns</mark>

1. Peer-to-Peer (P2P): Direct communication between two GPUs
2. Collective (CL): Communication involving multiple GPUs

#### <mark style="color:green;">Based on their empirical evaluation, the authors make several key observations</mark>

1. They identify four new types of GPU communication network NUMA effects:
   * Three are triggered by NVLink's topology, connectivity, and routing
   * One is caused by a PCIe chipset design issue
2. These NUMA effects indicate that choosing the right GPU combination can have a significant impact on GPU communication efficiency and overall application performance in a multi-GPU node.

### <mark style="color:purple;">The authors suggest that their evaluation results can be used to:</mark>

#### <mark style="color:green;">Build practical multi-GPU performance models</mark>

* These models are crucial for GPU task allocation, scheduling, and migration in shared environments like AI clouds and HPC centres
* They can also guide communication-oriented performance tuning

#### <mark style="color:green;">Gain deeper knowledge about the latest GPU interconnects</mark>

* This understanding can help in developing more mature multi-GPU programming, execution, and performance models
* It can also aid in building reliable simulators for better application development and performance tuning

### <mark style="color:purple;">Performance Details</mark>

The paper provides a detailed evaluation of intra-node and inter-node <mark style="color:blue;">**collective communication (CL)**</mark> performance on various platforms, including DGX-1, DGX-2, SLI-system, SummitDev, and Summit.&#x20;

The authors use the <mark style="color:blue;">**NCCL library**</mark> to *<mark style="color:yellow;">measure the latency and bandwidth</mark>* of different CL patterns, such as reduce, all-reduce, broadcast, reduce-scatter, and all-gather.

Key findings and well-explained technical details:

<mark style="color:green;">**NCCL ring topology:**</mark> The authors explain how NCCL constructs ring networks among the communication participants for efficient broadcasting and reduction operations on different interconnects (PCIe, NVLink-V1, and NVLink-V2).

<mark style="color:green;">**CL latency and bandwidth:**</mark> The paper presents detailed results on how CL latency and bandwidth vary with the number of participating GPUs, message size, and interconnect technology (PCIe, NVLink, NVSwitch).

<mark style="color:green;">**Interconnect comparison:**</mark> The authors highlight the differences in CL performance between PCIe (decreasing bandwidth with more GPUs) and NVLink (increasing bandwidth with more GPUs) due to their respective network topologies.

<mark style="color:green;">**NUMA effects:**</mark> The paper identifies significant NUMA effects for certain CL patterns (reduce-scatter and all-gather) with odd numbers of GPUs on NVLink, attributing this to NCCL's implementation rather than the interconnect topology.

<mark style="color:green;">**NVSwitch performance:**</mark> The authors demonstrate the superior CL bandwidth of NVSwitch compared to PCIe, particularly for reduce, all-reduce, and broadcast operations.

<mark style="color:green;">**Inter-node CL:**</mark> The paper evaluates inter-node CL performance on SummitDev and Summit supercomputers, highlighting the impact of GPUDirect-RDMA on latency and bandwidth, as well as the improvements in GPUDirect technology from SummitDev to Summit.

### <mark style="color:purple;">Future Research</mark>

Based on these challenges, the authors propose three research directions:

1. Developing novel multi-GPU programming models that are adaptive, portable, tractable, and capable of addressing the aforementioned complexities. They suggest reconsidering the role of <mark style="color:yellow;">**inter-GPU communication when designing new parallel models and algorithms**</mark>, given the advancements in GPU interconnect technologies.
2. Developing <mark style="color:yellow;">**practical multi-GPU performance models**</mark> for performance prediction, optimisation, and analytics in multi-GPU application development and tuning. These models are also important for GPU task allocation, scheduling, and migration in shared environments like cloud and HPC centres.
3. Developing <mark style="color:yellow;">**new communication patterns and libraries that better match the underlying interconnect and deliver high performance**</mark>. The authors provide examples, such as efficiently distributing and exchanging data among dual-subnetwork interconnect topologies in Summit, and exploring new communication patterns like <mark style="color:blue;">**2D-Torus**</mark> for better efficiency compared to NCCL's ring pattern.

The authors plan to pursue these research directions in their future work, leveraging their past experience in GPU analytic modelling and performance optimisation.&#x20;

Overall, the paper provides a comprehensive evaluation of modern GPU interconnect technologies and offers valuable insights into the challenges and opportunities for high-performance computing in multi-GPU systems.
