Page cover image

NVIDIA H100 NVL

For large language model inference workloads

The NVIDIA H100 NVL is a specialised high-end variant of the H100 GPU, designed specifically for large language model (LLM) inference workloads.

It differs from the standard H100 PCIe and H100 SXM versions in several key aspects:

Dual-GPU configuration

The H100 NVL consists of two PCIe-based H100 GPUs connected via high-speed NVLink bridges.

This dual-GPU setup allows for increased memory capacity and bandwidth compared to a single H100 GPU.

Dual GPUs!

Increased memory capacity

Each GPU in the H100 NVL has 94GB of HBM3 memory, for a total of 188GB across the two GPUs.

This is a significant increase from the 80GB of memory found in the standard H100 PCIe and SXM versions. The extra memory is crucial for handling large language models with billions of parameters.

Higher memory bandwidth

The H100 NVL offers a combined memory bandwidth of 7.8TB/s (3.9TB/s per GPU), surpassing the 2TB/s of the H100 PCIe and 3.35TB/s of the H100 SXM.

This higher bandwidth allows for faster data transfer between the GPUs and memory, enhancing performance in memory-intensive LLM workloads.

NVLink interconnect

The NVLink bridges connecting the two GPUs in the H100 NVL provide a high-speed, low-latency communication channel.

This allows the GPUs to efficiently share data and work together on LLM inference tasks.

Although the H100 NVL's NVLink bandwidth (600GB/s) is lower than the H100 SXM (900GB/s), it still offers a significant improvement over PCIe-based communication.

TDP and cooling

The H100 NVL has a higher TDP range (350-400W per GPU) compared to the standard H100 PCIe (300-350W).

This higher power budget allows for increased performance, but also requires more robust cooling solutions. The dual-slot air-cooled design of the H100 NVL addresses this need.

In terms of raw performance, the H100 NVL essentially doubles the compute capabilities of a single H100 GPU, with 134 teraFLOPS of FP32 and 7,916 teraFLOPS of FP8 Tensor Core performance.

This makes it an exceptionally powerful solution for LLM inference workloads.

The H100 NVL is aimed at organisations and users who require the highest level of performance for LLM inference, and are willing to invest in specialised hardware to achieve it.

While the H100 SXM remains the top choice for the most demanding HPC and AI training workloads, the H100 NVL fills a specific niche for LLM inference, offering a balance of high memory capacity, bandwidth, and compute performance in a PCIe form factor.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023