NVIDIA DGX H-100 System

An absolute beast

DGX refers to a family of purpose-built AI servers.

DGX systems simplify the adoption and deployment of AI infrastructure by providing an integrated hardware and software platform.

GPU Server Options

They come with NVIDIA Base Command, a software suite that includes cluster management, job scheduling, and monitoring tools. This allows organisations to quickly set up and manage their AI infrastructure without the complexity of building and integrating individual components.

Architecture

The NVIDIA DGX H100 is built around eight powerful NVIDIA H100 Tensor Core GPUs, enabling it to deliver unparalleled performance for AI training, inference, and high-performance computing (HPC) workloads.

NVIDIA H100 Tensor Core GPUs

At the heart of the DGX H100 system are the eight H100 GPUs, each providing 80GB of high-bandwidth GPU memory, totalling 640GB across the system.

The H100 GPUs are based on the NVIDIA Hopper architecture, which introduces significant advancements over the previous generation.

With fourth-generation Tensor Cores, the H100 GPUs achieve an astonishing 32 petaFLOPS of FP8 performance, enabling breakthrough speed and efficiency for AI workloads.

Connecting the GPUs - NVLink and NVSwitch Interconnects

The eight H100 GPUs in the DGX H100 system are connected using NVIDIA's high-speed NVLink and NVSwitch interconnect technologies.

NVLink provides high-bandwidth, low-latency communication, allowing the GPUs to work together efficiently on large-scale tasks.

The system also features four NVIDIA NVSwitch interconnects, enabling flexible and scalable GPU-to-GPU communication, optimising performance, and supporting larger models and datasets.

High-Performance Networking

The DGX H100 system includes an impressive networking infrastructure, with eight single-port NVIDIA ConnectX-7 VPI adapters offering up to 400Gb/s of InfiniBand or Ethernet connectivity per port, and two dual-port NVIDIA ConnectX-7 VPI adapters for additional high-speed networking capabilities.

This networking setup allows DGX H100 systems to be interconnected to form larger, scalable AI clusters, such as NVIDIA DGX SuperPOD, to tackle the most demanding AI and HPC challenges.

CPU and System Memory

Complementing the GPU and networking capabilities, the DGX H100 system is equipped with dual Intel Xeon Platinum 8480C processors, providing a total of 112 CPU cores. These processors are high-performance server-grade CPUs.

The high core count allows for efficient parallel processing and multitasking, enabling the system to handle complex computational tasks alongside the GPUs.

The Xeon Platinum 8480C processors support Intel Deep Learning Boost (Intel DL Boost) technology, which includes vectorized neural network instructions (VNNI) that accelerate AI inferencing performance.

CPU Frequency

The Intel Xeon Platinum 8480C processors have a base frequency of 2.00 GHz and a maximum boost frequency of 3.80 GHz.
The base frequency refers to the default clock speed at which the CPU cores operate under normal conditions.
The maximum boost frequency is the highest clock speed that the CPU cores can reach when performing demanding tasks or when thermal conditions allow.
The high boost frequency of 3.80 GHz enables the CPUs to deliver strong single-thread performance, which is beneficial for certain workloads that rely on single-core speed.

System Memory

The DGX H100 system is equipped with 2TB of system memory, which is a substantial amount for handling large datasets and memory-intensive workloads.
The ample system memory allows for efficient data caching, reducing the need for frequent data transfers between storage and memory.
With 2TB of memory, the DGX H100 system can accommodate large AI models, datasets, and intermediate results during training and inference processes.
The high memory capacity also enables data sharing and collaboration between the CPUs and GPUs, minimising data transfer bottlenecks.

Storage and Management

The DGX H100 system includes two 1.92TB NVMe M.2 drives for the operating system and eight 3.84TB NVMe U.2 drives for internal storage.
The system also features a baseboard management controller (BMC) for remote management and monitoring.

Performance and Specifications

AI Performance

With eight H100 GPUs, the DGX H100 system achieves 32 petaFLOPS of FP8 performance, enabling faster training and inference of large-scale AI models.
The system's high-speed NVLink and NVSwitch interconnects ensure efficient communication and collaboration between the GPUs, maximising AI throughput.

Networking Performance

The DGX H100 system's high-performance networking capabilities, with up to 400Gb/s InfiniBand or Ethernet connectivity per port, enable fast data transfer and communication between systems.
This high-speed networking infrastructure allows DGX H100 systems to be scaled up to form larger AI clusters, such as NVIDIA DGX SuperPOD, for tackling the most demanding AI and HPC workloads.

System Specifications

The DGX H100 system has a maximum power usage of 10.2kW, ensuring ample power delivery for the high-performance components.
The system weighs 287.6lbs (130.45kgs) and has dimensions of 14.0in (356mm) height, 19.0in (482.2mm) width, and 35.3in (897.1mm) length.
The operating temperature range for the DGX H100 system is 5–30°C (41–86°F).

Keeping a rack of 5 DGX systems cool

Power Density and Cooling Challenges

power density in data centres has significantly increased over the past decade, with racks consuming up to 10 times more power than before.
Cooling such high-density racks with traditional air cooling methods is becoming increasingly challenging.
With 5 DGX H100 systems in a rack, each consuming up to 19.8 kW, the total power consumption could reach around 100 kW per rack. This high power density necessitates advanced cooling solutions like liquid cooling.

Liquid Cooling as a Solution

liquid cooling is emerging as a promising solution to address the cooling challenges posed by high-density racks.
various liquid cooling technologies, such as immersion cooling and cold plate cooling, can effectively remove heat from high-power components.
Liquid cooling allows for higher power densities compared to traditional air cooling methods, making it suitable for racks with multiple DGX H100 systems.

Infrastructure Considerations

Implementing liquid cooling in existing data centres requires careful planning and infrastructure modifications.
challenges such as integrating liquid cooling with existing pipework, ensuring proper leak detection and containment, and monitoring the cooling system through the building management system (BMS).
Retrofitting liquid cooling into legacy data centres may involve operational challenges and additional costs.

Standardization and Collaboration

The participants emphasize the need for standardization and collaboration among industry players to drive the adoption of liquid cooling.
They suggest that having common standards and best practices for liquid cooling implementation would facilitate its deployment and ensure compatibility across different systems.
Collaboration with cooling solution providers, such as Stulz, is seen as crucial in developing tailored and efficient liquid cooling solutions for high-density racks.

Future-Proofing and Scalability

importance of designing data centres with future requirements in mind, considering the rapidly increasing power densities driven by AI and other advanced workloads.
Adopting liquid cooling technology not only addresses the immediate cooling needs of high-density racks but also future-proofs the data centre infrastructure for potential expansions and technology advancements.
Modular and scalable liquid cooling solutions are discussed as potential approaches to accommodate growing power densities and enable efficient cooling in both new and existing data centres.

implementing liquid cooling for a rack with 5 DGX H100 systems appears to be a viable and necessary solution. The high power density of these systems requires advanced cooling methods to ensure optimal performance and reliability.

However, the decision to adopt liquid cooling should be made after careful evaluation of the existing data centre infrastructure, considering factors such as space constraints, piping layout, and compatibility with existing systems.

Engaging with liquid cooling solution providers and industry experts can help in designing a tailored and efficient liquid cooling system for your specific requirements.

Additionally, it's crucial to consider the long-term scalability and future-proofing aspects of the liquid cooling implementation. As power densities continue to rise, the chosen liquid cooling solution should be flexible enough to accommodate potential expansions and technological advancements in the future.

PreviousNVIDIA DGX-2 NextNVLink Switch

Last updated 1 year ago

Was this helpful?