NVIDIA DGX-2
At the time of its 2019 release, traditional data centre architectures were increasingly unable to cope with the demands of modern AI workloads, which require immense computational power and high-speed interconnects to train increasingly complex models.
This challenge necessitated a paradigm shift towards more scalable and integrated systems.
NVIDIA's response to this challenge was the DGX-2, a system designed to offer unprecedented levels of compute performance and interconnect bandwidth, enabling the training of models that were previously untrainable due to hardware limitations.
Nvidia's DGX-2 stood as a major leap forward. When it was released, it claimed the title of "the world's most powerful AI system for the most complex AI challenges."
The system came with a price tag around $US400,000.
The Evolution from DGX-1 to DGX-2
The DGX-2 expanded on DGX-1 foundation dramatically.
Instead of eight GPUs, it packed 16 GPUs and replaced the NVLink bus with Nvidia’s more scalable NVSwitch technology.
This change allowed the DGX-2 to tackle deep learning and other demanding AI and HPC workloads up to 10 times faster than the DGX-1.
The system was a behemoth, both in terms of size and capability.
It weighed in at 154.2kg (340lbs) and took up 10 rack units, compared to the 3 rack units of the DGX-1.
It required up to 10kW of power, a figure that rose with the introduction of the DGX-2H model, which demanded up to 12kW.
A Closer Look at the DGX-2
Here’s what made the DGX-2 stand out:
GPUs: The DGX-2 featured 16 NVIDIA Tesla V100 GPUs. This doubling of GPU capacity, compared to the DGX-1, allowed for unprecedented computational power.
Memory and Storage: It came with 1.5 TB of system RAM and 30 TB of high-performance , expandable to 60 TB.
Networking: The server was equipped with high-bandwidth network interfaces, including dual 10/25/40/50/100GbE options and up to 8 x 100Gb/sec Infiniband connectivity.
CPU: At its core, the DGX-2 had two 24-core Intel Xeon Platinum 8168 processors, providing robust support for the GPUs.
Performance and Impact
The DGX-2’s performance was groundbreaking, delivering 2 petaFLOPS of processing power.
This level of performance meant that the DGX-2 could match the output of 300 dual-socket Xeon servers, which would cost around $2.7 million and occupy significantly more space.
Thus, despite its high upfront cost, the DGX-2 presented a cost-effective solution for intensive AI and HPC workloads.
Legacy and Conclusion
Though alternatives have since emerged, at the time, the DGX-2 represented a pinnacle in AI-focused servers.
It addressed the needs of the most complex AI tasks by dramatically reducing the time and infrastructure required to train deep learning models. Nvidia not only sold a server but also delivered a comprehensive ecosystem that supported the most advanced AI research and applications.
NVIDIA NVSwitch—Revolutionising AI Network Fabric
The introduction of the NVIDIA NVSwitch represented a leap in networking technology, akin to the evolution from dial-up to broadband.
NVSwitch enables a level of model parallelism previously unattainable, providing 2.4TB/s of bisection bandwidth, which is a 24 times increase over previous generations.
This high-performance interconnect fabric allows for unprecedented scaling capabilities, making it possible to train complex models across 16 GPUs efficiently and effectively.
A comparison between the DXG-2 and the DGX-1
CPUs
2 x Intel Xeon Platinum
2 x Intel Xeon E5-2600 v4
GPUs
16 x NVIDIA Tesla V100, 32GB HBM2 each
8 x NVIDIA Tesla V100, 16 GB HBM2 each
System Memory
Up to 1.5 TB DDR4
Up to 0.5 TB DDR4
GPU Memory
512 GB HBM2 (16 x 32 GB)
256 GB HBM2 (8 x 32 GB)
Storage
30 TB NVMe, expandable up to 60 TB
4 x 1.92 TB NVMe
Networking
8 x Infiniband or 8 x 100 GbE
4 x Infiniband + 2 x 10 GbE
Power
10 kW
3.5 kW
Size
350 lbs
134 lbs
GPU Throughput
Tensor: 1920 TFLOPs, FP16: 480 TFLOPs, FP32: 240 TFLOPs, FP64: 120 TFLOPs
Tensor: 960 TFLOPs, FP16: 240 TFLOPs, FP32: 120 TFLOPs, FP64: 60 TFLOPs
Cost
$399,000
$149,000
System Specifications
Component
Specification
GPUs
16x NVIDIA® Tesla® V100
GPU Memory
512GB total
Performance
2 petaFLOPS
NVIDIA CUDA® Cores
81,920
NVIDIA Tensor Cores
10,240
NVSwitches
12
Maximum Power Usage
10 kW
CPU
Dual Intel Xeon Platinum 8168, 2.7 GHz, 24-cores
System Memory
1.5TB
Network
8x 100Gb/sec Infiniband/100GigE, Dual 10/25/40/50/100GbE
Storage
OS: 2x 960GB NVME SSDs, Internal Storage: 30TB (8x 3.84TB) NVME SSDs
Software
Ubuntu Linux OS, Red Hat Enterprise Linux OS
System Weight
360 lbs (163.29 kgs)
Packaged System Weight
400 lbs (181.44 kgs)
System Dimensions
Height: 17.3 in, Width: 19.0 in, Length: 31.3 in (no bezel), 32.8 in (with bezel)
Operating Temperature Range
5°C to 35°C (41°F to 95°F)
Last updated