NVIDIA GB200 NVL72
Last updated
Copyright Continuum Labs - 2023
Last updated
The NVIDIA DGX GB200 NVL72 is a powerful rack-scale system built for demanding AI and high-performance computing (HPC) workloads. It comes with a price tag of about $US3 million.
At the heart of the DGX GB200 NVL72 are 18 compute nodes, each housing two Grace-Blackwell Superchips (GB200).
The GB200 Superchip is a marvel of engineering, combining a 72-core Grace CPU with two high-end Blackwell GPUs using NVIDIA's ultra-fast 900 GBps NVLink-C2C interconnect.
This tight integration allows for seamless communication between the CPU and GPUs, minimising latency and maximising performance.
The DGX GB200 NVL72 represents a significant advancement in AI and HPC computing, offering unprecedented performance and scalability in a single rack-scale system.
However, its high power consumption and cooling requirements may pose challenges for some data centres, potentially limiting its adoption to facilities capable of handling such high-density deployments.
The DGX GB200 NVL72 employs nine NVLink switch appliances, strategically placed in the middle of the rack.
Each switch appliance contains two NVIDIA NVLink 7.2T , providing a total of 144 100 GBps links.
This configuration allows each of the 72 GPUs in the rack to have 1.8 TBps (18 links) of bidirectional bandwidth, enabling lightning-fast data transfer and synchronisation between GPUs.
The NVLink switches and compute nodes are connected via a blind mate backplane with more than 2 miles (3.2 km) of copper cabling, chosen over optical connections to reduce power consumption by 20 kW.
Each GB200 Superchip is equipped with an impressive 864 GB of memory, consisting of 480 GB LPDDR5x for the CPU and 384 GB HBM3e for the GPUs.
This memory capacity, coupled with the architecture of the Blackwell GPUs, enables each Superchip to deliver an astonishing 40 petaFLOPS of sparse FP4 performance.
When all 18 compute nodes work together, the entire DGX GB200 NVL72 rack can achieve a staggering 1.44 exaFLOPS of super-low-precision floating-point performance, making it an ideal platform for AI and HPC workloads.
Here is the 120kW flagship system stacked in a single rack.
The DGX GB200 NVL72 weighs 1.36 metric tons (3,000 lbs) and consumes a 120kW, a power load that not all data centres will be able to handle.
As many can only support a maximum of 60kW racks, a future half-stack system seems a possibility.
The rack uses 2 miles (3.2 km) of copper cabling instead of optics to lower the system's power draw by 20kW.
Some statistics:
It's a rack-scale solution that connects 36 Grace CPUs and 72 Blackwell GPUs.
Liquid-cooled design with a 72-GPU NVLink domain acting as a single massive GPU.
Delivers 30x faster real-time performance for trillion-parameter LLM inference compared to NVIDIA H100 Tensor Core GPU.
Enables 4 times faster training for large language models at scale compared to H100.
Provides 2 times more energy efficiency than H100 air-cooled infrastructure.
Speeds up key database queries by 18 times compared to CPU, delivering a 5 times better total cost of ownership.
In terms of power consumption, the DGX GB200 NVL72 rack consumes 120 kW.
Each compute node is estimated to consume between 5.4 kW and 5.7 kW, considering the two GB200 Superchips and five .
The rack is equipped with six power shelves, three at the top and three at the bottom, to supply the necessary 120 kW of power.
The power shelves are likely using 415V, 60A with some level of redundancy built into the design.
The decision to use copper cabling instead of optical connections was made to reduce the power draw by an additional 20 kW, as the retimers and transceivers required for optics would have added to the already substantial power consumption.
Powering and cooling the DGX GB200 NVL72 is no small feat, given its impressive performance.
A hyperscale-style DC bus bar runs down the back of the rack, efficiently distributing power to all components.
To keep the system running at optimal temperatures, the compute nodes and NVLink switches are liquid-cooled, with coolant entering the rack at 25°C and exiting 20 degrees warmer. Low-power peripherals, such as NICs and storage, are cooled using conventional 40mm fans.
The DGX GB200 NVL72 is designed to scale, allowing organisations to expand their AI and HPC capabilities as needed.
Eight DGX GB200 NVL72 racks can be networked together to form a DGX Superpod, housing an impressive 576 GPUs for tackling even larger training workloads.
If more power is required, additional Superpods can be added to the system, providing virtually limitless scalability.
Networking and storage are key components of the DGX GB200 NVL72.
Each compute node features four InfiniBand NICs (QSFP-DD) for high-speed, low-latency communication within the compute network.
Additionally, a BlueField-3 DPU is included in each node to handle storage network communications efficiently.
For local storage, each node is equipped with four small form-factor NVMe storage caddies, providing fast access to data.
Major organisations across various sectors are expected to adopt Blackwell, including Amazon Web Services, Dell Technologies, Google, Meta, Microsoft, OpenAI, Oracle, Tesla, and xAI.
Cloud service providers like AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure will offer Blackwell-powered instances, while server makers such as Cisco, Dell, Hewlett Packard Enterprise, Lenovo, and Supermicro are expected to deliver servers based on Blackwell products.
Software makers in engineering simulation, such as Ansys, Cadence, and Synopsys, will also leverage Blackwell-based processors to accelerate their software.