NVIDIA Magnum IO GPUDirect Storage (GDS)

NVIDIA Magnum IO GPUDirect Storage (GDS)

NVIDIA Magnum IO GPUDirect Storage (GDS) is a technology designed to accelerate data transfers between GPU memory and remote or local storage by avoiding CPU bottlenecks.

Here are the key technical details:

Data Path

  • GDS creates a direct data path between local NVMe or remote storage and GPU memory.

  • This is enabled via a engine near the network adapter or storage that transfers data into or out of GPU memory, bypassing the bounce buffer in the CPU.

  • Traditional reads and writes to GPU memory use POSIX APIs to read/write data from system memory as an intermediate , which can cause IO bottlenecks.


The POSIX APIs (Portable Operating System Interface) are a family of standards specified by the IEEE for maintaining compatibility between operating systems.

POSIX defines the application programming interface (API), along with command line shells and utility interfaces, for software compatibility with variants of Unix and other operating systems.

Traditional Reads and Writes Using POSIX APIs

In the context of GPU computing, POSIX APIs are used to handle file I/O operations, such as reading from and writing to files. These APIs manage the movement of data between the storage (like a disk) and the application through the system's memory.

Here’s how it typically works:

  1. Reading Data:

    • Data Transfer: Data is first read from the storage into the system memory using POSIX read calls like read().

    • Intermediate Buffering: The data resides temporarily in what is often referred to as a "bounce buffer" in the system RAM.

    • Transfer to GPU Memory: From there, another operation is needed to transfer the data to the GPU memory for processing, typically managed through the GPU’s DMA (Direct Memory Access) capabilities or with additional API calls.

  2. Writing Data:

    • Data Collection: Data generated or processed by the GPU that needs to be saved is first transferred back to the system memory.

    • Intermediate Buffering: It is stored temporarily in the system RAM.

    • Writing Out: Finally, POSIX write calls like write() are used to move the data from the system RAM to the permanent storage.

Issues with POSIX APIs in GPU Computing

While POSIX APIs are widely supported and used for handling file operations, they pose several challenges and inefficiencies when used in GPU computing environments, particularly for high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML) workloads:

  • Double Data Movement: Data must move from storage to system memory and then to GPU memory, which adds overhead and latency.

  • CPU Bottlenecks: The CPU must manage both reads and writes, which not only consumes CPU cycles that could otherwise be used for computation but also limits the speed at which data can be moved into and out of GPU memory.

  • Increased Latency and Reduced Throughput: The additional steps required for data to travel through system memory introduce delays and reduce the overall throughput of data processing applications, especially those requiring rapid access to large volumes of data.

Alternatives to Improve Efficiency

To mitigate these issues and enhance performance, technologies like NVIDIA’s GPUDirect Storage (GDS) are used. GDS allows data to bypass the CPU and system memory entirely, providing a direct data path from storage to GPU memory:

  • Direct Memory Access (DMA): GDS leverages DMA to move data directly between GPU memory and storage, reducing latency, and freeing up CPU resources for other tasks.

  • Eliminating Bounce Buffers: By removing the need for intermediate bounce buffers in system memory, GDS reduces the complexity of data handling and improves data transfer rates.

This approach significantly accelerates applications by minimizing data movement and reducing the load on the CPU, which is crucial for performance-critical applications such as deep learning inference and large-scale simulations.

Components and Integration

  • GDS is exposed within CUDA via the cuFile API.

  • The cuFile API is integrated into the CUDA Toolkit (version 11.4 and later) or delivered via a separate package containing a user-level library (libcufile) and kernel module (nvidia-fs).

  • The user-level library is integrated into the CUDA Toolkit runtime, and the kernel module is installed with the NVIDIA driver.

  • NVIDIA Mellanox OFED (MLNX_OFED) is required and must be installed prior to GDS installation.

Supported Technologies

  • GDS supports RDMA over InfiniBand and Ethernet RoCE.

  • It supports distributed file systems such as NFS, DDN EXAScaler, WekaIO, and IBM Spectrum Scale.

  • GDS supports storage protocols via NVMe and NVMe-oF.

  • It provides a compatibility mode for non-GDS ready platforms.

  • GDS is enabled on NVIDIA DGX Base OS and supports Ubuntu and RHEL operating systems.

Integration with Libraries, APIs, and Frameworks

  • GDS can be used with multiple libraries, APIs, and frameworks, including DALI (Data Loading Library), RAPIDS cuDF, PyTorch, and MXNet.

Performance Benefits

  • Higher Bandwidth: GDS achieves up to 2X more bandwidth available to the GPU compared to a standard CPU-to-CPU path.

  • Lower Latency: By avoiding extra copies in the host system memory and providing dynamic routing, GDS optimizes path, buffers, and mechanisms, resulting in lower latency.

  • Reduced CPU Utilization: The use of DMA engines near storage is less invasive to CPU load and doesn't interfere with GPU load. At larger data sizes, the ratio of bandwidth to fractional CPU utilization is much higher with GDS.

Benchmarking Results

  • GDSIO Benchmark: Up to 1.5X improvement in bandwidth available to the GPU and up to 2.8X improvement in CPU utilization compared to traditional data paths via the CPU bounce buffer.

  • DeepCAM Benchmark: When optimized with GDS and NVIDIA DALI, DeepCAM (a deep learning model for climate simulations) can achieve up to a 6.6X speedup compared to out-of-the-box NumPy.

In summary, NVIDIA Magnum IO GPUDirect Storage is an evolutionary technology that enables direct data transfer between GPU memory and storage, bypassing the CPU. This results in higher bandwidth, lower latency, and reduced CPU utilization, leading to improved performance for GPU-accelerated workflows in HPC, AI, and data analytics.

Last updated


Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023