NVIDIA DGX H-100 System
An absolute beast
Last updated
Copyright Continuum Labs - 2023
An absolute beast
Last updated
DGX refers to a family of purpose-built AI servers.
DGX systems simplify the adoption and deployment of AI infrastructure by providing an integrated hardware and software platform.
They come with NVIDIA Base Command, a software suite that includes cluster management, job scheduling, and monitoring tools. This allows organisations to quickly set up and manage their AI infrastructure without the complexity of building and integrating individual components.
The NVIDIA DGX H100 is built around eight powerful NVIDIA H100 Tensor Core GPUs, enabling it to deliver unparalleled performance for AI training, inference, and high-performance computing (HPC) workloads.
NVIDIA H100 Tensor Core GPUs
At the heart of the DGX H100 system are the eight H100 GPUs, each providing 80GB of high-bandwidth GPU memory, totalling 640GB across the system.
The H100 GPUs are based on the NVIDIA Hopper architecture, which introduces significant advancements over the previous generation.
With fourth-generation Tensor Cores, the H100 GPUs achieve an astonishing 32 petaFLOPS of FP8 performance, enabling breakthrough speed and efficiency for AI workloads.
The eight H100 GPUs in the DGX H100 system are connected using NVIDIA's high-speed NVLink and NVSwitch interconnect technologies.
NVLink provides high-bandwidth, low-latency communication, allowing the GPUs to work together efficiently on large-scale tasks.
The system also features four NVIDIA NVSwitch interconnects, enabling flexible and scalable GPU-to-GPU communication, optimising performance, and supporting larger models and datasets.
The DGX H100 system includes an impressive networking infrastructure, with eight single-port NVIDIA ConnectX-7 VPI adapters offering up to 400Gb/s of InfiniBand or Ethernet connectivity per port, and two dual-port NVIDIA ConnectX-7 VPI adapters for additional high-speed networking capabilities.
This networking setup allows DGX H100 systems to be interconnected to form larger, scalable AI clusters, such as NVIDIA DGX SuperPOD, to tackle the most demanding AI and HPC challenges.
Complementing the GPU and networking capabilities, the DGX H100 system is equipped with dual Intel Xeon Platinum 8480C processors, providing a total of 112 CPU cores. These processors are high-performance server-grade CPUs.
The high core count allows for efficient parallel processing and multitasking, enabling the system to handle complex computational tasks alongside the GPUs.
The Xeon Platinum 8480C processors support Intel Deep Learning Boost (Intel DL Boost) technology, which includes vectorized neural network instructions (VNNI) that accelerate AI inferencing performance.
The Intel Xeon Platinum 8480C processors have a base frequency of 2.00 GHz and a maximum boost frequency of 3.80 GHz.
The base frequency refers to the default clock speed at which the CPU cores operate under normal conditions.
The maximum boost frequency is the highest clock speed that the CPU cores can reach when performing demanding tasks or when thermal conditions allow.
The high boost frequency of 3.80 GHz enables the CPUs to deliver strong single-thread performance, which is beneficial for certain workloads that rely on single-core speed.
The DGX H100 system is equipped with 2TB of system memory, which is a substantial amount for handling large datasets and memory-intensive workloads.
The ample system memory allows for efficient data caching, reducing the need for frequent data transfers between storage and memory.
With 2TB of memory, the DGX H100 system can accommodate large AI models, datasets, and intermediate results during training and inference processes.
The high memory capacity also enables data sharing and collaboration between the CPUs and GPUs, minimising data transfer bottlenecks.
The DGX H100 system includes two 1.92TB NVMe M.2 drives for the operating system and eight 3.84TB NVMe U.2 drives for internal storage.
The system also features a baseboard management controller (BMC) for remote management and monitoring.
AI Performance
With eight H100 GPUs, the DGX H100 system achieves 32 petaFLOPS of FP8 performance, enabling faster training and inference of large-scale AI models.
The system's high-speed NVLink and NVSwitch interconnects ensure efficient communication and collaboration between the GPUs, maximising AI throughput.
Networking Performance
The DGX H100 system's high-performance networking capabilities, with up to 400Gb/s InfiniBand or Ethernet connectivity per port, enable fast data transfer and communication between systems.
This high-speed networking infrastructure allows DGX H100 systems to be scaled up to form larger AI clusters, such as NVIDIA DGX SuperPOD, for tackling the most demanding AI and HPC workloads.
System Specifications
The DGX H100 system has a maximum power usage of 10.2kW, ensuring ample power delivery for the high-performance components.
The system weighs 287.6lbs (130.45kgs) and has dimensions of 14.0in (356mm) height, 19.0in (482.2mm) width, and 35.3in (897.1mm) length.
The operating temperature range for the DGX H100 system is 5–30°C (41–86°F).