LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Traditional Ethernet Setups
  • Spectrum-X Network Platform
  • Role of BlueField-3 SuperNIC
  • What is RoCE Adaptive Routing (AR) and Packet Reordering
  • Advanced Congestion Control
  • Key Points
  • Secure Networking
  • Conclusion

Was this helpful?

  1. Infrastructure
  2. Networking and Connectivity

Next-generation networking in AI environments

In the evolving landscape of artificial intelligence (AI) and machine learning (ML), the demand for high-performance, efficient, and secure networking solutions has never been greater.

As AI workloads become increasingly complex and data-intensive, traditional networking architectures struggle to keep pace, leading to suboptimal performance and reduced efficiency. NVIDIA, a leader in accelerated computing, has developed the Spectrum-X platform to address these challenges head-on.

The Spectrum-X platform is a networking solution designed specifically for AI workloads in multi-tenant cloud environments.

By combining cutting-edge technologies like RoCE Adaptive Routing, Packet Reordering, and Advanced Congestion Control with the BlueField-3 SuperNIC and Spectrum-4 switch, Spectrum-X enables organisations to harness the full potential of their AI infrastructure.

This document aims to provide a comprehensive overview of the NVIDIA Spectrum-X platform, its key components, and the innovative features that set it apart from traditional networking solutions.

We will explore the challenges faced by AI workloads in multi-tenant cloud environments and demonstrate how Spectrum-X addresses these issues to deliver unparalleled performance, efficiency, and security.

Whether you are a data scientist, IT professional, or business decision-maker, understanding the capabilities of the NVIDIA Spectrum-X platform is essential for anyone looking to optimise their AI infrastructure and stay ahead of the curve in this rapidly evolving field.

Key Characteristics of Traditional Cloud Networks and Networks for AI

Characteristics
Traditional Ethernet-based Clouds
AI Computing Ethernet Networking

Application Coupling

Loosely coupled applications

Distributed tightly coupled processing

Bandwidth and Utilisation

Low bandwidth TCP flows and utilization

High bandwidth RoCE flows and utilisation

Tolerance to Jitter

High jitter tolerance

Low jitter tolerance

Traffic Type

Heterogeneous traffic, statistical multi-pathing

Bursty network capacity, elephant flows

This table contrasts with traditional cloud networking, which typically supports a variety of applications with less intensive bandwidth and jitter requirements, against AI computing networks that need high bandwidth and low latency to efficiently process tightly coupled, data-intensive tasks.

Key Features of General-Purpose CPU Systems and GPU-Accelerated Systems

Features
General-Purpose CPU Systems
GPU-Accelerated Systems

Processor Type

General-purpose processor handles a wide range of tasks

Specialised processor designed for parallel computation

Core Configuration

Usually ships with two CPUs with a few dozen cores in total

Systems with four to eight GPUs each with tens of thousands of cores

Scaling

Scale-out to a few dozen nodes per workload

Workloads operate at data center-scale, up to tens of thousands of GPUs

Network I/O Focus

CPU-centric network I/O

GPU-centric network I/O

This table highlights the differences in architecture and scale between traditional CPU-based systems and modern GPU-accelerated systems.

GPU systems feature highly specialised processors capable of handling massive parallel computations across many cores and nodes, contrasting with the more versatile but less parallel nature of CPU systems.

Traditional Ethernet Setups

In a conventional Ethernet environment, network efficiency varies significantly based on the physical location of nodes within a data centre.

Impact of Workload Placement: If nodes involved in a particular job are located within the same rack, they tend to perform better due to reduced network latency and higher bandwidth utilisation.

Conversely, distributing the same workload across nodes in different racks increases latency and often results in bandwidth underutilisation due to longer routing paths and potential network congestion.

Efficiency Calculation: Network efficiency in this context is typically measured as the percentage of peak bandwidth achieved during job execution. Higher internal rack communications tend to utilise closer to peak bandwidth capacities.

Spectrum-X Network Platform

NVIDIA Spectrum-X is an Ethernet networking platform optimised for AI.

It achieves this through the tight coupling of NVIDIA Spectrum-4, an Ethernet switch, and NVIDIA BlueField-3 SuperNIC, a network accelerator.

This solution relies on a network-aware congestion algorithm that utilises real-time telemetry data streamed from network switches to manage and prevent network congestion.

Developed to provide consistent high-performance levels across distributed workloads, regardless of node placement.

Unlike traditional Ethernet setups, Spectrum-X mitigates the variability in performance tied to node location, effectively enhancing overall network efficiency—reportedly outperforming standard setups by up to 60%.

Spectrum-X's telemetry gathers comprehensive, high-frequency data that is leveraged to enhance data transmission and optimise network efficiency.

This high-frequency sampling is essential for revealing the bursty nature of AI networks and effectively managing congestion at the data centre level.

Role of BlueField-3 SuperNIC

The BlueField-3 SuperNIC is a cornerstone technology within the Spectrum-X platform, designed specifically to enhance the performance and efficiency of AI and hyperscale workloads.

It is a programmable network accelerator that allows users to implement customised congestion control algorithms.

It integrates an advanced , which provides a dedicated compute engine optimised for I/O-intensive and low-code packet processing.

The BlueField-3 SuperNIC enables secure, zero-trust VPC (Virtual Private Cloud) networking tailored for the AI compute plane.

The BlueField-3 SuperNIC supports 400Gb/s Ethernet speeds, which aligns with the RoCE standards for high-performance networking.

BlueField-3 SuperNIC supports 400Gb/s Ethernet speeds via RDMA over Converged Ethernet (ROCE), ensuring that the data transfer operations are handled efficiently at the network card level, offloading processing tasks from the CPU.

GPU-to-GPU Communication

In training AI models, especially those based on complex neural networks, GPUs need to exchange intermediate data frequently.

RoCE facilitates direct GPU-to-GPU communications, which is essential for parallel processing tasks where multiple GPUs across several nodes work together to process data simultaneously.

Energy Efficient

The BlueField-3 SuperNIC is designed with energy efficiency in mind, featuring a sub-75-watt, half-height, PCIe form factor. It is compatible with most enterprise-class servers and facilitates effective scaling to match the number of GPUs in a system.

What is RoCE Adaptive Routing (AR) and Packet Reordering

RoCE Adaptive Routing (AR) and Packet Reordering is a technique used by NVIDIA's Spectrum-X platform to optimise network performance and efficiency for AI workloads.

It addresses the limitations of traditional IP routing techniques, such as Equal Cost Multipath (ECMP), which can lead to network congestion and inefficient load balancing, especially when dealing with "elephant flows" common in AI training.

Elephant flows are high-bandwidth, long-duration data flows that often occur between the same pairs of GPU nodes during AI training. These flows can saturate the entire network bandwidth and persist for extended periods.

RoCE Adaptive Routing and Packet Reordering is a smart way to manage network traffic for AI workloads, making sure data gets where it needs to go quickly and efficiently. It's like having a really good traffic controller for your network.

Think of your network as a bunch of roads connecting different parts of a city (in this case, the city is your AI system with lots of GPUs). Some roads are bigger than others, and sometimes there's a lot of traffic that needs to get from one part of the city to another.

Now, imagine there are a few big trucks (elephant flows) that take up a lot of space on the roads. These trucks carry important stuff for AI, like data for training models. They need to get to their destination fast, but they can cause traffic jams if they're not managed well.

The old way of managing traffic is like having a rule that says "always split up the trucks evenly across all the roads." But this doesn't always work well, because sometimes the trucks still end up causing traffic jams on some roads while others are empty.

RoCE Adaptive Routing is like having a smart traffic controller that looks at the whole city and decides, for each truck, which road it should take based on how busy the roads are. It might send some parts of a truck down one road, and other parts down another road, to keep things moving smoothly.

But now the trucks might arrive at their destination with their parts all mixed up! That's where Packet Reordering comes in. It's like having a really efficient team at the destination that can take all the mixed-up parts and put the trucks back together again super fast.

So, with RoCE Adaptive Routing and Packet Reordering, your AI system can handle big data flows more efficiently, making sure everything gets where it needs to go quickly and smoothly. This helps your AI workloads run faster and better, without getting stuck in network traffic jams.

In summary, RoCE Adaptive Routing and Packet Reordering, enabled by the integration of the Spectrum-4 switch and BlueField-3 SuperNIC, deliver high network performance and efficiency in AI workloads.

By dynamically routing packets on a per-packet basis and reordering them on the receiving end, available network paths are optimised, congestion is minimised and consistent performance ensured - accelerating Ethernet AI workloads.

Advanced Congestion Control

Congestion occurs when the network becomes overwhelmed with data, causing slowdowns and hindering the performance of AI training and inference tasks.

Ethernet networks are inherently prone to congestion, and managing this issue is particularly challenging in AI environments.

Advanced Congestion Control is a critical component in creating efficient networks for AI tasks.

When a network becomes overloaded with data, it can lead to slower speeds and reduced performance for AI training and inference. This is especially true for Ethernet networks, which are naturally susceptible to congestion.

Key Points

Traditional networks using TCP/IP employ flow control and sliding window techniques to stop the sender from overwhelming the receiver with too much data. However, these approaches aren't ideal for AI workloads.

AI networks use RoCE ( over Converged Ethernet) for GPU-to-GPU communication.

RoCE requires networks with low latency and high reliability. As a result, these networks need advanced congestion control methods to effectively handle network traffic when congestion happens.

Also, because AI clouds are often used by multiple users simultaneously, known as a multi-tenant environment. If one user's job causes congestion, it can create a domino effect, increasing delays and decreasing available bandwidth for other AI tasks.

Another reality to deal with is that AI model training has a unique, bursty traffic pattern because of collective operations, where many GPU nodes work together to distribute the workload. This bursty traffic makes standard congestion control methods less effective.

DCQCN (Data Center Quantized Congestion Notification does not work...

DCQCN (Data Center Quantized Congestion Notification) is a technique used in many cloud environments to proactively detect and respond to network congestion.

It uses to warn sending devices about potential congestion before data packets are lost. However, DCQCN might not be sufficient for generative AI clouds, where traffic patterns are extremely bursty.

How does Spectrum-X deal with it?

NVIDIA's Spectrum-X platform offers a solution for advanced congestion control, made possible by combining the BlueField-3 SuperNIC and Spectrum-4 switch.

Telemetry data is critical...

Spectrum-X's telemetry technology collects comprehensive, high-frequency network data about the network's performance and health.

Then the network-aware congestion algorithm uses this real-time telemetry data from the network switches to manage and prevent congestion.

The Spectrum-4 switch's in-band telemetry capabilities keep the sender's BlueField-3 SuperNIC informed about the current network usage status, sending prompt alerts when congestion starts to build up. The SuperNIC then adjusts transmission rates accordingly to stop further congestion from occurring.

BlueField-3 SuperNICs run the congestion control algorithm, handling millions of congestion control events per second with microsecond reaction times and making accurate rate adjustment decisions.

You can make your own network algorithms!

The BlueField-3 SuperNIC emphasises full programmability, enabling users to create and implement custom congestion control algorithms tailored to their specific AI workloads and data centre network layouts.

This is achievable through the SuperNIC's advanced Datapath Accelerator (DPA) and the DOCA (Data Center on a Chip Architecture) programmable model.

Secure Networking

Multi-tenant cloud environments, where multiple users share the same physical infrastructure, require strict isolation of tenant traffic to ensure data privacy and prevent unauthorised access.

Traditionally, general-purpose clouds use various network technologies like virtual private clouds (VPCs) to achieve this isolation.

However, AI clouds introduce additional complexity due to their dedicated AI compute networks, which demand high-throughput and low-latency connectivity for GPU servers.

s that rely solely on CPUs are not sufficient for the high-performance connectivity needed in AI compute networks.

Moreover, many AI cloud environments offer bare-metal as-a-service (BMaaS), making it impractical to deploy tenant networking software directly on the compute nodes.

To address this issue, bare-metal cloud environments often use EVPN (Ethernet VPN) and VXLAN (Virtual Extensible LAN) on network switches to establish tenant isolation. While this provides a solution for AI compute networks, it lacks advanced features like access-lists and security groups, and it doesn't scale well when expanding to tens of thousands of GPUs.

This is where NVIDIA's BlueField-3 SuperNIC comes in. It empowers cloud architects to implement secure, zero-trust VPC networking tailored specifically for AI compute planes.

The BlueField-3 SuperNIC leverages accelerated switching and packet processing (ASAP2) technology, enabling a combination of software-defined and hardware-accelerated network connectivity.

The NVIDIA ASAP2 technology stack offers a range of network acceleration capabilities and full programmability through the DOCA FLOW SDK, delivering significantly faster performance compared to non-accelerated network environments.

Out of the box, the BlueField-3 SuperNIC provides two paths for creating secure, multi-tenant, and high-performance AI compute network environments:

  1. / -based SDN acceleration solution

  2. - EVPN (Ethernet VPN)-based network solution

While both SDN and EVPN VXLAN create multi-tenant networks, they differ in their approach.

SDN centralises control and abstracts network resources, while EVPN VXLAN distributes control using a BGP-based control plane coupled with MAC learning.

The BlueField-3 SuperNIC offloads and accelerates both SDN and EVPN-based solutions, with the software stack running exclusively on the SuperNIC.

One of the key security features of the BlueField-3 SuperNIC is its inline encryption acceleration, which operates at speeds of up to 400Gb/s. This acceleration engine is compatible with other inline accelerations, allowing AI cloud builders to encrypt all East-West communications within the AI compute network.

By encrypting traffic between servers in the same data centre, the BlueField-3 SuperNIC adds an extra layer of protection against cyber threats and enhances the overall security posture of the AI platform. Developers can use the DOCA IPsec software library's API to enable BlueField-accelerated flow encryption and decryption.

The BlueField-3 SuperNIC is particularly well-suited for securing and accelerating VPC networking in , multi-tenant AI clouds.

Its integrated compute subsystem within the network I/O path provides a secure foundation for deploying tenant networking solutions and enforcing fine-grained network policies. This further strengthens the security of the AI cloud platform as a whole.

Conclusion

The NVIDIA Spectrum-X platform, with its advanced networking technologies like RoCE Adaptive Routing, Packet Reordering, and Advanced Congestion Control, provides a powerful solution for the unique challenges faced by AI workloads in multi-tenant cloud environments. By leveraging the capabilities of the BlueField-3 SuperNIC and Spectrum-4 switch, Spectrum-X enables high-performance, efficient, and secure networking for AI applications, ensuring optimal GPU utilization and faster time-to-insight.

PreviousScalable Hierarchical Aggregation and Reduction Protocol (SHARP)NextNVIDIA Collective Communications Library (NCCL)

Last updated 11 months ago

Was this helpful?

NVIDIA Spectrum-X Networking Platform
BlueField-3 SuperNIC with GPUDirect RoCE Enables Direct GPU- to-GPU Communication
Spectrum-4 Detects Congestion Spots in Real-Time; BlueField-3 Adjusts the Transmission Rate
Page cover image