# NVIDIA DGX H-100 System

<mark style="color:blue;">**DGX**</mark> refers to a family of purpose-built <mark style="color:blue;">**AI servers.**</mark> &#x20;

<mark style="color:blue;">**DGX systems**</mark> simplify the adoption and deployment of AI infrastructure by providing an <mark style="color:yellow;">**integrated hardware and software platform**</mark>.&#x20;

<details>

<summary><mark style="color:green;"><strong>GPU Server Options</strong></mark></summary>

</details>

They come with <mark style="color:blue;">**NVIDIA Base Command**</mark>, a software suite that includes cluster management, job scheduling, and monitoring tools.   This allows organisations to quickly set up and manage their AI infrastructure without the complexity of building and integrating individual components.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FxCtYTK8XntzLeAlBJIj7%2Fimage.png?alt=media&#x26;token=f5647cd5-f230-4ab6-b991-58e1ff50f9bb" alt=""><figcaption></figcaption></figure>

### <mark style="color:purple;">Architecture</mark>

The <mark style="color:blue;">**NVIDIA DGX H100**</mark> is built around <mark style="color:yellow;">**eight**</mark> powerful [<mark style="color:blue;">**NVIDIA H100 Tensor Core GPUs**</mark>](https://training.continuumlabs.ai/infrastructure/servers-and-chips/the-nvidia-h100-gpu), enabling it to deliver unparalleled performance for AI training, inference, and high-performance computing (HPC) workloads.

{% embed url="<https://www.youtube.com/watch?v=a_tXcmEeGxo>" %}

<mark style="color:green;">**NVIDIA H100 Tensor Core GPUs**</mark>

At the heart of the <mark style="color:blue;">**DGX H100**</mark> system are the <mark style="color:yellow;">**eight**</mark> H100 GPUs, each providing <mark style="color:yellow;">**80GB**</mark> of high-bandwidth GPU memory, totalling <mark style="color:yellow;">**640GB**</mark> across the system.&#x20;

The H100 GPUs are based on the NVIDIA <mark style="color:blue;">**Hopper architecture**</mark>, which introduces significant advancements over the previous generation. &#x20;

With fourth-generation Tensor Cores, the H100 GPUs achieve an astonishing <mark style="color:yellow;">**32 petaFLOPS of FP8 performance**</mark>, enabling breakthrough speed and efficiency for AI workloads.

#### <mark style="color:green;">Connecting the GPUs - NVLink and NVSwitch Interconnects</mark>

The <mark style="color:yellow;">**eight**</mark> H100 GPUs in the <mark style="color:blue;">**DGX H100**</mark> system are connected using NVIDIA's high-speed [<mark style="color:blue;">**NVLink**</mark>](https://training.continuumlabs.ai/infrastructure/servers-and-chips/nvlink-switch) and <mark style="color:blue;">**NVSwitch**</mark> interconnect technologies.

NVLink provides high-bandwidth, low-latency communication, allowing the GPUs to work together efficiently on large-scale tasks.&#x20;

The system also features <mark style="color:yellow;">**four**</mark> NVIDIA [<mark style="color:blue;">**NVSwitch interconnects**</mark>](#nvlink-and-nvswitch-interconnects), enabling flexible and scalable GPU-to-GPU communication, optimising performance, and supporting larger models and datasets.

#### <mark style="color:green;">**High-Performance Networking**</mark>

The DGX H100 system includes an impressive networking infrastructure, with <mark style="color:yellow;">**eight**</mark> single-port <mark style="color:blue;">**NVIDIA ConnectX-7 VPI adapters**</mark> offering up to <mark style="color:yellow;">**400Gb/s**</mark> of <mark style="color:blue;">**InfiniBand**</mark> or <mark style="color:blue;">**Ethernet**</mark> connectivity per port, and <mark style="color:yellow;">**two**</mark> dual-port <mark style="color:blue;">**NVIDIA ConnectX-7 VPI adapters**</mark> for additional high-speed networking capabilities.&#x20;

This networking setup allows DGX H100 systems to be interconnected to form larger, scalable AI clusters, such as <mark style="color:blue;">**NVIDIA DGX SuperPOD**</mark>, to tackle the most demanding AI and HPC challenges.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FkSu4bRBXeUt0FpJjohhY%2Fimage.png?alt=media&#x26;token=110a5acb-ff4d-4f97-ba87-843c6e999157" alt="" width="563"><figcaption></figcaption></figure>

#### <mark style="color:green;">CPU and System Memory</mark>

Complementing the GPU and networking capabilities, the <mark style="color:blue;">**DGX H100**</mark> system is equipped with dual Intel Xeon Platinum 8480C processors, providing a total of <mark style="color:yellow;">**112**</mark> CPU cores. These processors are <mark style="color:yellow;">**high-performance server-grade CPUs**</mark>.

The high core count allows for *<mark style="color:yellow;">**efficient parallel processing and multitasking**</mark>*, enabling the system to handle complex computational tasks alongside the GPUs.

The Xeon Platinum 8480C processors support Intel Deep Learning Boost (Intel DL Boost) technology, which includes vectorized neural network instructions (VNNI) that accelerate AI inferencing performance.

#### <mark style="color:green;">**CPU Frequency**</mark>

* The Intel Xeon Platinum 8480C processors have a <mark style="color:blue;">**base frequency**</mark> of <mark style="color:yellow;">2.00</mark> GHz and a maximum boost frequency of 3.80 GHz.
* The base frequency refers to the <mark style="color:blue;">**default clock speed**</mark> at which the CPU cores operate under normal conditions.
* The maximum boost frequency is the highest clock speed that the CPU cores can reach when performing demanding tasks or when thermal conditions allow.
* The high boost frequency of 3.80 GHz enables the CPUs to deliver strong single-thread performance, which is beneficial for certain workloads that rely on single-core speed.

#### <mark style="color:green;">System Memory</mark>

* The DGX H100 system is equipped with <mark style="color:yellow;">**2TB**</mark> of system memory, which is a substantial amount for handling large datasets and memory-intensive workloads.
* The ample system memory allows for <mark style="color:yellow;">**efficient data caching**</mark>, reducing the need for frequent data transfers between storage and memory.
* With 2TB of memory, the DGX H100 system can accommodate large AI models, datasets, and intermediate results during training and inference processes.
* The high memory capacity also enables data sharing and collaboration between the CPUs and GPUs, minimising data transfer bottlenecks.

#### <mark style="color:green;">Storage and Management</mark>

* The DGX H100 system includes two 1.92TB NVMe M.2 drives for the operating system and eight 3.84TB NVMe U.2 drives for internal storage.
* The system also features a baseboard management controller (BMC) for remote management and monitoring.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FYDYJ1LNLvb9CLJ8W7uNs%2Fimage.png?alt=media&#x26;token=a7019caa-78cb-4d05-b170-9411de5f7ff0" alt=""><figcaption><p>Microsoft Azure NVIDIA DGX H100 Installation</p></figcaption></figure>

### <mark style="color:green;">Performance and Specifications</mark>

<mark style="color:blue;">**AI Performance**</mark>

* With eight H100 GPUs, the DGX H100 system achieves 32 petaFLOPS of FP8 performance, enabling faster training and inference of large-scale AI models.
* The system's high-speed NVLink and NVSwitch interconnects ensure efficient communication and collaboration between the GPUs, maximising AI throughput.

<mark style="color:blue;">**Networking Performance**</mark>

* The DGX H100 system's high-performance networking capabilities, with up to 400Gb/s InfiniBand or Ethernet connectivity per port, enable fast data transfer and communication between systems.
* This high-speed networking infrastructure allows DGX H100 systems to be scaled up to form larger AI clusters, such as NVIDIA DGX SuperPOD, for tackling the most demanding AI and HPC workloads.

<mark style="color:blue;">**System Specifications**</mark>

* The DGX H100 system has a maximum power usage of 10.2kW, ensuring ample power delivery for the high-performance components.
* The system weighs 287.6lbs (130.45kgs) and has dimensions of 14.0in (356mm) height, 19.0in (482.2mm) width, and 35.3in (897.1mm) length.
* The operating temperature range for the DGX H100 system is 5–30°C (41–86°F).

<details>

<summary><mark style="color:green;"><strong>Keeping a rack of 5 DGX systems cool</strong></mark></summary>

<mark style="color:blue;">Power Density and Cooling Challenges</mark>

* power density in data centres has significantly increased over the past decade, with racks consuming up to 10 times more power than before.
* Cooling such high-density racks with traditional air cooling methods is becoming increasingly challenging.
* With 5 DGX H100 systems in a rack, each consuming up to 19.8 kW, the total power consumption could reach around 100 kW per rack. This high power density necessitates advanced cooling solutions like liquid cooling.

<mark style="color:blue;">Liquid Cooling as a Solution</mark>

* liquid cooling is emerging as a promising solution to address the cooling challenges posed by high-density racks.
* various liquid cooling technologies, such as immersion cooling and cold plate cooling,  can effectively remove heat from high-power components.
* Liquid cooling allows for higher power densities compared to traditional air cooling methods, making it suitable for racks with multiple DGX H100 systems.

<mark style="color:blue;">Infrastructure Considerations</mark>

* Implementing liquid cooling in existing data centres requires careful planning and infrastructure modifications.
* challenges such as integrating liquid cooling with existing pipework, ensuring proper leak detection and containment, and monitoring the cooling system through the building management system (BMS).
* Retrofitting liquid cooling into legacy data centres may involve operational challenges and additional costs.

<mark style="color:blue;">Standardization and Collaboration</mark>

* The participants emphasize the need for standardization and collaboration among industry players to drive the adoption of liquid cooling.
* They suggest that having common standards and best practices for liquid cooling implementation would facilitate its deployment and ensure compatibility across different systems.
* Collaboration with cooling solution providers, such as Stulz, is seen as crucial in developing tailored and efficient liquid cooling solutions for high-density racks.

<mark style="color:blue;">Future-Proofing and Scalability</mark>

* importance of designing data centres with future requirements in mind, considering the rapidly increasing power densities driven by AI and other advanced workloads.
* Adopting liquid cooling technology not only addresses the immediate cooling needs of high-density racks but also future-proofs the data centre infrastructure for potential expansions and technology advancements.
* Modular and scalable liquid cooling solutions are discussed as potential approaches to accommodate growing power densities and enable efficient cooling in both new and existing data centres.

implementing liquid cooling for a rack with 5 DGX H100 systems appears to be a viable and necessary solution. The high power density of these systems requires advanced cooling methods to ensure optimal performance and reliability.

However, the decision to adopt liquid cooling should be made after careful evaluation of the existing data centre infrastructure, considering factors such as space constraints, piping layout, and compatibility with existing systems.&#x20;

Engaging with liquid cooling solution providers and industry experts can help in designing a tailored and efficient liquid cooling system for your specific requirements.

Additionally, it's crucial to consider the long-term scalability and future-proofing aspects of the liquid cooling implementation. As power densities continue to rise, the chosen liquid cooling solution should be flexible enough to accommodate potential expansions and technological advancements in the future.

</details>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://training.continuumlabs.ai/infrastructure/servers-and-chips/nvidia-dgx-h-100-system.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
