TCO of NVIDIA GPUs and falling barriers to entry
Performance gains and total cost of ownership (TCO) improvements offered by Nvidia's new Blackwell GPU architecture, specifically the B100, B200, and GB200 models, compared to the previous generation Hopper GPUs (H100 and H200).
The key points and performance measurement methods are as follows:
Specification improvements
The B100, B200, and GB200 offer increased FLOPS and memory bandwidth compared to the H100 and H200.
The GB200 delivers up to 2,500 TFLOPS of FP16/BF16 compute, a 153% improvement over the H100 and H200.
Memory bandwidth increases from 3.4 TB/s (H100) and 4.8 TB/s (H200) to 8.0 TB/s in the Blackwell family.
Silicon area and power consumption
The Blackwell GPUs have a larger silicon area (~1600mm2 with 208B transistors) compared to Hopper (~800mm2 with 80B transistors).
When normalised for silicon area, the performance gains are less impressive, with the B200 delivering only a 14% FP16 FLOPS improvement per silicon area.
The GB200's 47% improvement in TFLOPS per GPU Watt vs. the H100 is helpful but not enough to achieve the claimed 30x inference performance without further quantization.
Quantization and number formats
Quantization and various number formats (FP16, BF16, FP8, FP6, FP4) were important in achieving higher 'headline' performance.
The majority of the claimed 30x inference performance gains come from quantization and architectural improvements, not just raw specifications.
NVL72 and 72-way parallelism stacking
The GB200 NVL72 enables a non-blocking all-to-all network among 72 GPUs, expanding the set of possible parallelism configurations.
The article compares the performance of the H200 and GB200 NVL72 using different parallelism schemes and quantization levels.
Benchmark scenario and marketing claims
Nvidia's claimed 30x performance gain is based on a specific benchmark scenario that favours the GB200 NVL72 by using FP4 quantization and imposing constraints that limit the performance of the H200 and B200.
When using the same quantization level (FP8) and a more balanced scenario, the performance gain is closer to 18x.
Profitability and TCO
The article mentions that the performance gains will be analysed in terms of their impact on the profitability of inference systems and the overall TCO improvements.
Fragmentation of data centres
Ease of operation
GPU clouds are significantly easier to operate than general-purpose clouds from a software perspective.
They require fewer advanced services, such as database management, block storage, and security guarantees for multi-tenancy. This lower complexity makes it easier for new entrants to enter the market.
Homogeneity of workloads
GPU clouds need less flexibility in terms of compute, storage, RAM, and networking compared to standard clouds.
The Nvidia H100 GPU is considered the optimal choice for most modern use cases, including large language model (LLM) training and high-volume LLM/diffusion inference. This homogeneity simplifies infrastructure choices for end-users and cloud providers.
Total Cost of Ownership (TCO) difference
The TCO equation for CPU servers and GPU servers in a colocation environment differs significantly.
For CPU servers, hosting costs (e.g. $200-$300 per month) are comparable to capital costs ($300 to $400 per month). In contrast, for GPU servers, the capital costs ($7,000 per month) far outweigh the hosting costs ($2,000 per month).
This difference is primarily due to the very large capital cost of NVIDIA GPUs and servers.
Hyperscale cloud providers' advantages
Companies like Google, Amazon, and Microsoft have an advantage in optimising hosting costs through better datacentre design and operation.
They achieve lower Power Usage Effectiveness (PUE) ratios, indicating more efficient power usage. However, this advantage is less significant for GPU servers, as the capital costs dominate the TCO equation.
Third-party cloud economics
Even with relatively poor datacentre operation and high-interest debt, a colocation provider can offer an all-in cost per hour of $1.525 for an Nvidia HGX H100 server.
In comparison, the most favourable GPU cloud deals are around $2 per hour per H100, with some users paying over $3 per hour. This difference in pricing allows for significant returns for cloud providers.
In summary, the economics of GPU clouds differ from traditional CPU-based clouds.
The high capital costs of GPUs make it easier for new entrants to compete with established hyperscale cloud providers.
The homogeneity of workloads and the lower complexity of software requirements further lower the barriers to entry. As a result, the market has seen a surge in pureplay GPU cloud providers, with the potential for significant returns despite higher hosting costs compared to hyperscale cloud providers.
Last updated