# NVIDIA GPUDirect

<mark style="color:blue;">**NVIDIA GPUDirect**</mark> is a family of technologies that enables direct communication and data transfer between GPUs and other devices like network adapters, storage drives, and video I/O devices.&#x20;

It is designed to reduce latency, increase bandwidth, and decrease CPU overhead in high-performance computing (HPC), data analytics, and AI workloads.&#x20;

### <mark style="color:green;">GPUDirect History and Architecture</mark>

GPUDirect was first introduced in 2010 with GPUDirect 1.0, allowing <mark style="color:blue;">**direct memory access (DMA)**</mark> transfers *<mark style="color:yellow;">**between GPUs within the same system**</mark>*.

As it evolved, it introduced support for peer-to-peer (P2P) communication between GPUs across [<mark style="color:blue;">**PCIe**</mark>](https://training.continuumlabs.ai/infrastructure/networking-and-connectivity/pcie-peripheral-component-interconnect-express).

It then moved to enable direct data transfer between GPUs and third-party devices like network adapters, eliminating the need for intermediate copies in system memory.

The most recent iteration extended *<mark style="color:yellow;">direct data transfer capabilities to storage devices</mark>*, enabling a direct path between local/remote storage and GPU memory.

### <mark style="color:green;">**How GPUDirect Works**</mark>

GPUDirect technologies leverage <mark style="color:blue;">**DMA (Direct Memory Access)**</mark> engines in devices like <mark style="color:blue;">**NICs**</mark>, storage controllers, and GPUs to move data directly to/from GPU memory.

NICs, or <mark style="color:blue;">**Network Interface Cards**</mark>, are hardware components used to connect a computer or other device to a network.  They enable devices to communicate over a network by providing a physical interface for transmitting and receiving data. &#x20;

In the context of GPUDirect, NICs with DMA (Direct Memory Access) capabilities allow for efficient data transfers directly to and from GPU memory, bypassing the CPU and reducing latency and CPU overhead.&#x20;

This capability is particularly important in high-performance computing environments where maximising data transfer speeds and minimising latency are critical.

GPUDirect exposes GPU memory addresses to the <mark style="color:blue;">**PCI Express (PCIe) address space**</mark>, allowing devices to access GPU memory directly <mark style="color:yellow;">**without involving the CPU**</mark>.

By eliminating intermediate data copies and reducing CPU involvement, GPUDirect reduces latency, increases bandwidth, and frees up CPU resources.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FGuRNCZKZSyTCOSHOCX71%2Fimage.png?alt=media&#x26;token=14250464-a5d6-4620-9764-3f5fb1eefe75" alt=""><figcaption><p><strong>GPUDirect Storage</strong> enables a direct data path between local or remote storage, such as NVMe or NVMe over Fabric (NVMe-oF), and GPU memory. It avoids extra copies through a bounce buffer in the CPU’s memory, enabling a direct memory access (DMA) engine near the NIC or storage to move data on a direct path into or out of GPU memory — all without burdening the CPU</p></figcaption></figure>

### <mark style="color:green;">Integration with NVIDIA Quantum InfiniBand</mark>

NVIDIA Quantum InfiniBand is a high-performance, low-latency interconnect designed for AI and HPC workloads.

<mark style="color:blue;">**GPUDirect RDMA**</mark> is a key technology that enables efficient data transfer between GPUs across InfiniBand networks.

With GPUDirect RDMA, data can be directly transferred between GPU memory of different nodes without involving the CPU or system memory.

This direct data path significantly reduces latency and increases bandwidth, enabling scalable multi-GPU and multi-node performance. &#x20;

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FarsTDuDtXN4OdalcUNXp%2Fimage.png?alt=media&#x26;token=bdb89bd5-1476-409c-b073-dc690f2201f0" alt=""><figcaption><p>Direct Communication between NVIDIA GPUs</p></figcaption></figure>

### <mark style="color:green;">**Integration with Other NVIDIA Systems**</mark>

GPUDirect technologies work with NVIDIA's accelerated computing platforms, including [<mark style="color:blue;">**DGX systems**</mark>](https://training.continuumlabs.ai/infrastructure/servers-and-chips/nvidia-dgx-h-100-system) and HGX servers.

GPUDirect Storage enables fast data transfer between storage devices (local NVMe or remote storage over NVMe-oF) and GPU memory in these systems.

It leverages the high-bandwidth, low-latency PCIe topology in NVIDIA's systems to optimise data paths and maximise performance.

GPUDirect technologies also integrate with NVIDIA's software stack, including [<mark style="color:blue;">**CUDA**</mark>](#user-content-fn-1)[^1], [<mark style="color:blue;">**cuFile API**</mark>](#user-content-fn-2)[^2], and [<mark style="color:blue;">**RAPIDS**</mark>](#user-content-fn-3)[^3], enabling developers to take advantage of direct data paths in their applications.

### <mark style="color:green;">Summary</mark>

In summary, NVIDIA GPUDirect is a suite of technologies that optimise data movement and access for GPUs, reducing latency, increasing bandwidth, and offloading CPU overhead.&#x20;

It is a critical component in NVIDIA's accelerated computing stack, enabling high-performance AI, HPC, and data analytics workloads.&#x20;

GPUDirect RDMA, in particular, <mark style="color:yellow;">**works closely with NVIDIA Quantum InfiniBand**</mark> to provide fast, direct data transfer between GPUs across network nodes, enabling scalable multi-GPU and multi-node performance.&#x20;

As GPU computing power continues to grow, GPUDirect technologies play an increasingly important role in relieving I/O bottlenecks and enabling efficient data movement in GPU-accelerated systems.

[^1]: CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs). With CUDA, developers can dramatically speed up computing applications by harnessing the power of GPUs for parallelisable tasks.

[^2]: The cuFile API is part of NVIDIA's GPUDirect Storage technology. It is a file-access API that provides direct data transfer between GPU memory and storage, bypassing the CPU to reduce latency and improve data transfer rates. This enables applications to achieve higher performance and efficiency in data-intensive operations.

[^3]: RAPIDS is an open-source data science software library built on CUDA for executing end-to-end data science and analytics pipelines entirely on GPUs. It significantly accelerates data preparation, machine learning algorithms, and mathematical computations, enabling scientists and researchers to process huge datasets much faster than with traditional CPU-based computing.
