Next-generation networking in AI environments

In the evolving landscape of artificial intelligence (AI) and machine learning (ML), the demand for high-performance, efficient, and secure networking solutions has never been greater.

As AI workloads become increasingly complex and data-intensive, traditional networking architectures struggle to keep pace, leading to suboptimal performance and reduced efficiency. NVIDIA, a leader in accelerated computing, has developed the Spectrum-X platform to address these challenges head-on.

The Spectrum-X platform is a networking solution designed specifically for AI workloads in multi-tenant cloud environments.

By combining cutting-edge technologies like RoCE Adaptive Routing, Packet Reordering, and Advanced Congestion Control with the BlueField-3 SuperNIC and Spectrum-4 switch, Spectrum-X enables organisations to harness the full potential of their AI infrastructure.

This document aims to provide a comprehensive overview of the NVIDIA Spectrum-X platform, its key components, and the innovative features that set it apart from traditional networking solutions.

We will explore the challenges faced by AI workloads in multi-tenant cloud environments and demonstrate how Spectrum-X addresses these issues to deliver unparalleled performance, efficiency, and security.

Whether you are a data scientist, IT professional, or business decision-maker, understanding the capabilities of the NVIDIA Spectrum-X platform is essential for anyone looking to optimise their AI infrastructure and stay ahead of the curve in this rapidly evolving field.

Key Characteristics of Traditional Cloud Networks and Networks for AI

Characteristics

Traditional Ethernet-based Clouds

AI Computing Ethernet Networking

Application Coupling

Loosely coupled applications

Distributed tightly coupled processing

Bandwidth and Utilisation

Low bandwidth TCP flows and utilization

High bandwidth RoCE flows and utilisation

Tolerance to Jitter

High jitter tolerance

Low jitter tolerance

Traffic Type

Heterogeneous traffic, statistical multi-pathing

Bursty network capacity, elephant flows

This table contrasts with traditional cloud networking, which typically supports a variety of applications with less intensive bandwidth and jitter requirements, against AI computing networks that need high bandwidth and low latency to efficiently process tightly coupled, data-intensive tasks.

Key Features of General-Purpose CPU Systems and GPU-Accelerated Systems

Features

General-Purpose CPU Systems

GPU-Accelerated Systems

Processor Type

General-purpose processor handles a wide range of tasks

Specialised processor designed for parallel computation

Core Configuration

Usually ships with two CPUs with a few dozen cores in total

Systems with four to eight GPUs each with tens of thousands of cores

Scaling

Scale-out to a few dozen nodes per workload

Workloads operate at data center-scale, up to tens of thousands of GPUs

Network I/O Focus

CPU-centric network I/O

GPU-centric network I/O

This table highlights the differences in architecture and scale between traditional CPU-based systems and modern GPU-accelerated systems.

GPU systems feature highly specialised processors capable of handling massive parallel computations across many cores and nodes, contrasting with the more versatile but less parallel nature of CPU systems.

Traditional Ethernet Setups

In a conventional Ethernet environment, network efficiency varies significantly based on the physical location of nodes within a data centre.

Impact of Workload Placement: If nodes involved in a particular job are located within the same rack, they tend to perform better due to reduced network latency and higher bandwidth utilisation.

Conversely, distributing the same workload across nodes in different racks increases latency and often results in bandwidth underutilisation due to longer routing paths and potential network congestion.

Efficiency Calculation: Network efficiency in this context is typically measured as the percentage of peak bandwidth achieved during job execution. Higher internal rack communications tend to utilise closer to peak bandwidth capacities.

Spectrum-X Network Platform

NVIDIA Spectrum-X is an Ethernet networking platform optimised for AI.

It achieves this through the tight coupling of NVIDIA Spectrum-4, an Ethernet switch, and NVIDIA BlueField-3 SuperNIC, a network accelerator.

This solution relies on a network-aware congestion algorithm that utilises real-time telemetry data streamed from network switches to manage and prevent network congestion.

Developed to provide consistent high-performance levels across distributed workloads, regardless of node placement.

Unlike traditional Ethernet setups, Spectrum-X mitigates the variability in performance tied to node location, effectively enhancing overall network efficiency—reportedly outperforming standard setups by up to 60%.

Spectrum-X's telemetry gathers comprehensive, high-frequency data that is leveraged to enhance data transmission and optimise network efficiency.

This high-frequency sampling is essential for revealing the bursty nature of AI networks and effectively managing congestion at the data centre level.

Role of BlueField-3 SuperNIC

The BlueField-3 SuperNIC is a cornerstone technology within the Spectrum-X platform, designed specifically to enhance the performance and efficiency of AI and hyperscale workloads.

It is a programmable network accelerator that allows users to implement customised congestion control algorithms.

It integrates an advanced , which provides a dedicated compute engine optimised for I/O-intensive and low-code packet processing.

The BlueField-3 SuperNIC enables secure, zero-trust VPC (Virtual Private Cloud) networking tailored for the AI compute plane.

The BlueField-3 SuperNIC supports 400Gb/s Ethernet speeds, which aligns with the RoCE standards for high-performance networking.

BlueField-3 SuperNIC supports 400Gb/s Ethernet speeds via RDMA over Converged Ethernet (ROCE), ensuring that the data transfer operations are handled efficiently at the network card level, offloading processing tasks from the CPU.

GPU-to-GPU Communication

In training AI models, especially those based on complex neural networks, GPUs need to exchange intermediate data frequently.

RoCE facilitates direct GPU-to-GPU communications, which is essential for parallel processing tasks where multiple GPUs across several nodes work together to process data simultaneously.

Energy Efficient

The BlueField-3 SuperNIC is designed with energy efficiency in mind, featuring a sub-75-watt, half-height, PCIe form factor. It is compatible with most enterprise-class servers and facilitates effective scaling to match the number of GPUs in a system.

What is RoCE Adaptive Routing (AR) and Packet Reordering

RoCE Adaptive Routing (AR) and Packet Reordering is a technique used by NVIDIA's Spectrum-X platform to optimise network performance and efficiency for AI workloads.

It addresses the limitations of traditional IP routing techniques, such as Equal Cost Multipath (ECMP), which can lead to network congestion and inefficient load balancing, especially when dealing with "elephant flows" common in AI training.

Elephant flows are high-bandwidth, long-duration data flows that often occur between the same pairs of GPU nodes during AI training. These flows can saturate the entire network bandwidth and persist for extended periods.

RoCE Adaptive Routing and Packet Reordering is a smart way to manage network traffic for AI workloads, making sure data gets where it needs to go quickly and efficiently. It's like having a really good traffic controller for your network.

Think of your network as a bunch of roads connecting different parts of a city (in this case, the city is your AI system with lots of GPUs). Some roads are bigger than others, and sometimes there's a lot of traffic that needs to get from one part of the city to another.

Now, imagine there are a few big trucks (elephant flows) that take up a lot of space on the roads. These trucks carry important stuff for AI, like data for training models. They need to get to their destination fast, but they can cause traffic jams if they're not managed well.

The old way of managing traffic is like having a rule that says "always split up the trucks evenly across all the roads." But this doesn't always work well, because sometimes the trucks still end up causing traffic jams on some roads while others are empty.

RoCE Adaptive Routing is like having a smart traffic controller that looks at the whole city and decides, for each truck, which road it should take based on how busy the roads are. It might send some parts of a truck down one road, and other parts down another road, to keep things moving smoothly.

But now the trucks might arrive at their destination with their parts all mixed up! That's where Packet Reordering comes in. It's like having a really efficient team at the destination that can take all the mixed-up parts and put the trucks back together again super fast.

So, with RoCE Adaptive Routing and Packet Reordering, your AI system can handle big data flows more efficiently, making sure everything gets where it needs to go quickly and smoothly. This helps your AI workloads run faster and better, without getting stuck in network traffic jams.

In summary, RoCE Adaptive Routing and Packet Reordering, enabled by the integration of the Spectrum-4 switch and BlueField-3 SuperNIC, deliver high network performance and efficiency in AI workloads.

By dynamically routing packets on a per-packet basis and reordering them on the receiving end, available network paths are optimised, congestion is minimised and consistent performance ensured - accelerating Ethernet AI workloads.

Advanced Congestion Control

Congestion occurs when the network becomes overwhelmed with data, causing slowdowns and hindering the performance of AI training and inference tasks.

Ethernet networks are inherently prone to congestion, and managing this issue is particularly challenging in AI environments.

Advanced Congestion Control is a critical component in creating efficient networks for AI tasks.

When a network becomes overloaded with data, it can lead to slower speeds and reduced performance for AI training and inference. This is especially true for Ethernet networks, which are naturally susceptible to congestion.

Key Points

Traditional networks using TCP/IP employ flow control and sliding window techniques to stop the sender from overwhelming the receiver with too much data. However, these approaches aren't ideal for AI workloads.

AI networks use RoCE ( over Converged Ethernet) for GPU-to-GPU communication.

RoCE requires networks with low latency and high reliability. As a result, these networks need advanced congestion control methods to effectively handle network traffic when congestion happens.

Also, because AI clouds are often used by multiple users simultaneously, known as a multi-tenant environment. If one user's job causes congestion, it can create a domino effect, increasing delays and decreasing available bandwidth for other AI tasks.

Another reality to deal with is that AI model training has a unique, bursty traffic pattern because of collective operations, where many GPU nodes work together to distribute the workload. This bursty traffic makes standard congestion control methods less effective.

DCQCN (Data Center Quantized Congestion Notification does not work...

DCQCN (Data Center Quantized Congestion Notification) is a technique used in many cloud environments to proactively detect and respond to network congestion.

It uses to warn sending devices about potential congestion before data packets are lost. However, DCQCN might not be sufficient for generative AI clouds, where traffic patterns are extremely bursty.

How does Spectrum-X deal with it?

NVIDIA's Spectrum-X platform offers a solution for advanced congestion control, made possible by combining the BlueField-3 SuperNIC and Spectrum-4 switch.

Telemetry data is critical...

Spectrum-X's telemetry technology collects comprehensive, high-frequency network data about the network's performance and health.

Then the network-aware congestion algorithm uses this real-time telemetry data from the network switches to manage and prevent congestion.

The Spectrum-4 switch's in-band telemetry capabilities keep the sender's BlueField-3 SuperNIC informed about the current network usage status, sending prompt alerts when congestion starts to build up. The SuperNIC then adjusts transmission rates accordingly to stop further congestion from occurring.

BlueField-3 SuperNICs run the congestion control algorithm, handling millions of congestion control events per second with microsecond reaction times and making accurate rate adjustment decisions.

You can make your own network algorithms!

The BlueField-3 SuperNIC emphasises full programmability, enabling users to create and implement custom congestion control algorithms tailored to their specific AI workloads and data centre network layouts.

This is achievable through the SuperNIC's advanced Datapath Accelerator (DPA) and the DOCA (Data Center on a Chip Architecture) programmable model.

Secure Networking

Multi-tenant cloud environments, where multiple users share the same physical infrastructure, require strict isolation of tenant traffic to ensure data privacy and prevent unauthorised access.

Traditionally, general-purpose clouds use various network technologies like virtual private clouds (VPCs) to achieve this isolation.

However, AI clouds introduce additional complexity due to their dedicated AI compute networks, which demand high-throughput and low-latency connectivity for GPU servers.

s that rely solely on CPUs are not sufficient for the high-performance connectivity needed in AI compute networks.

Moreover, many AI cloud environments offer bare-metal as-a-service (BMaaS), making it impractical to deploy tenant networking software directly on the compute nodes.

To address this issue, bare-metal cloud environments often use EVPN (Ethernet VPN) and VXLAN (Virtual Extensible LAN) on network switches to establish tenant isolation. While this provides a solution for AI compute networks, it lacks advanced features like access-lists and security groups, and it doesn't scale well when expanding to tens of thousands of GPUs.

This is where NVIDIA's BlueField-3 SuperNIC comes in. It empowers cloud architects to implement secure, zero-trust VPC networking tailored specifically for AI compute planes.

The BlueField-3 SuperNIC leverages accelerated switching and packet processing (ASAP2) technology, enabling a combination of software-defined and hardware-accelerated network connectivity.

The NVIDIA ASAP2 technology stack offers a range of network acceleration capabilities and full programmability through the DOCA FLOW SDK, delivering significantly faster performance compared to non-accelerated network environments.

Out of the box, the BlueField-3 SuperNIC provides two paths for creating secure, multi-tenant, and high-performance AI compute network environments:

/ -based SDN acceleration solution
- EVPN (Ethernet VPN)-based network solution

While both SDN and EVPN VXLAN create multi-tenant networks, they differ in their approach.

SDN centralises control and abstracts network resources, while EVPN VXLAN distributes control using a BGP-based control plane coupled with MAC learning.

The BlueField-3 SuperNIC offloads and accelerates both SDN and EVPN-based solutions, with the software stack running exclusively on the SuperNIC.

One of the key security features of the BlueField-3 SuperNIC is its inline encryption acceleration, which operates at speeds of up to 400Gb/s. This acceleration engine is compatible with other inline accelerations, allowing AI cloud builders to encrypt all East-West communications within the AI compute network.

By encrypting traffic between servers in the same data centre, the BlueField-3 SuperNIC adds an extra layer of protection against cyber threats and enhances the overall security posture of the AI platform. Developers can use the DOCA IPsec software library's API to enable BlueField-accelerated flow encryption and decryption.

The BlueField-3 SuperNIC is particularly well-suited for securing and accelerating VPC networking in , multi-tenant AI clouds.

Its integrated compute subsystem within the network I/O path provides a secure foundation for deploying tenant networking solutions and enforcing fine-grained network policies. This further strengthens the security of the AI cloud platform as a whole.

Conclusion

The NVIDIA Spectrum-X platform, with its advanced networking technologies like RoCE Adaptive Routing, Packet Reordering, and Advanced Congestion Control, provides a powerful solution for the unique challenges faced by AI workloads in multi-tenant cloud environments. By leveraging the capabilities of the BlueField-3 SuperNIC and Spectrum-4 switch, Spectrum-X enables high-performance, efficient, and secure networking for AI applications, ensuring optimal GPU utilization and faster time-to-insight.

PreviousScalable Hierarchical Aggregation and Reduction Protocol (SHARP)NextNVIDIA Collective Communications Library (NCCL)

Last updated 1 year ago

Was this helpful?