LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Definitions
  • Key Differences
  • One View on why Ethernet will win
  • The Future
  • The counter view
  • The debate will continue

Was this helpful?

  1. Infrastructure
  2. Networking and Connectivity

Infiniband versus Ethernet

Networking Technologies

PreviousNetworking and ConnectivityNextNVIDIA Quantum InfiniBand

Last updated 11 months ago

Was this helpful?

InfiniBand and Ethernet are both networking technologies used for data communication, but they have different origins, architectures, and target applications.

Definitions

InfiniBand

InfiniBand is a high-performance, low-latency interconnect standard designed for connecting servers, storage systems, and other data centre components. It was developed specifically for high-performance computing (HPC) and other data-intensive applications.

Ethernet

Ethernet is a widely-used, general-purpose networking technology that connects devices in local area networks (LANs) and wide area networks (WANs). It was initially designed for office environments and has evolved to support a wide range of applications and speeds.

Technology Standards

Both InfiniBand and Ethernet are technology standards, not specific products. They define the rules and specifications for communication between devices in a network.

Various vendors develop and manufacture products (such as network adapters, switches, and cables) that adhere to these standards.

The Components of Infiniband

Channel Adapters: These are like the on-ramps and off-ramps of the superhighway. They help computers and devices get on and off the InfiniBand network. There are two types: a. Host Channel Adapters (HCAs): These are used by things like servers or storage devices to connect to the InfiniBand network. b. Target Channel Adapters (TCAs): These are used by special devices, usually for storage.

Switches: These are like the traffic lights of the superhighway. They make sure data goes where it needs to go quickly and efficiently.

Routers: If you want to connect multiple InfiniBand networks together (like connecting multiple superhighways), you use routers. They help data move between the different networks.

Cables and Connectors: These are like the roads of the superhighway. They physically connect everything together.

The smallest complete InfiniBand network is called a "subnet," and you can connect multiple subnets together using routers to create a huge InfiniBand network. It's like connecting multiple cities with superhighways.

What makes InfiniBand special is that it's really fast (low-latency), can move a lot of data quickly (high-bandwidth), and is easy to manage (low-management cost).

It's perfect for connecting a lot of computers together (clustering), moving data between computers (communications), storing data (storage), and managing everything (management) - all in one network.

So, in a nutshell, InfiniBand is a super-fast, efficient way for computers and devices to talk to each other, making it easier to build big, powerful computer systems.

Key Differences

Performance

It is argued InfiniBand offers lower latency and higher throughput compared to Ethernet, making it more suitable for performance-critical applications like HPC and AI workloads.

This is why InfiniBand has historically been the go-to networking solution for HPC and AI workloads - low latency, high bandwidth, and deterministic performance characteristics.

Nonetheless. the Ultra Ethernet Consortium, led by companies like Broadcom, Cisco, and Intel, is pushing for the adoption of Ethernet in AI networking. They argue that modern Ethernet can offer similar, if not better, performance compared to InfiniBand at a lower cost.

It is true that Ethernet has made strides with technologies like RDMA over Converged Ethernet (RoCE), but it still lacks the same level of performance as InfiniBand. Studies have shown that to achieve comparable performance, Ethernet needs to operate at speeds 1.3 times faster than InfiniBand.

So while Ethernet has caught up with InfiniBand in terms of raw bandwidth, with both technologies offering 400 Gbps speeds, InfiniBand still maintains an edge in terms of latency and deterministic performance.

Remote Direct Memory Access (RDMA)

As highlighted, InfiniBand natively supports RDMA, which is one of the reasons it has historically been preferred for HPC and AI workloads.

Ethernet, on the other hand, has traditionally relied on TCP/IP for data transport, which involves more overhead and higher latency.

However, more recent Ethernet standards, such as RoCE (RDMA over Converged Ethernet), have added support for RDMA over Ethernet networks, allowing Ethernet to achieve lower latency and higher throughput than traditional TCP/IP-based Ethernet.

Reliability

In the context of networking, reliability refers to a network's ability to consistently deliver data without errors or loss. Two key aspects of reliability are fabric behavior and flow control.

Fabric behavior describes the overall structure and performance of a network.

In a reliable fabric, data is delivered consistently and without loss, even in the presence of network congestion or device failures.

For example, InfiniBand provides a lossless fabric, ensuring that no data is lost during transmission, regardless of network conditions. This is achieved through end-to-end flow control, which prevents data loss by signalling the sender to slow down or stop sending data when the receiver is unable to process incoming data at the same rate.

On the other hand, Ethernet, in its basic form, is a best-effort delivery system, meaning that it will attempt to deliver data but does not guarantee successful delivery.

However, recent Ethernet standards, such as priority flow control (PFC), have added support for lossless behavior, allowing Ethernet to provide more reliable data delivery, similar to InfiniBand.

InfiniBand provides a lossless fabric with built-in flow control and congestion management mechanisms. It guarantees reliable data delivery and maintains a consistent level of performance, even under heavy load.

The work to make Ethernet the leading network fabric for the modern data centre

The goal is to make Ethernet the leading network fabric for modern data centres and high-performance computing by providing features that are currently more prevalent in other technologies like InfiniBand.

The IEEE 802.1 working group is collaborating with industry partners, including chip vendors (Broadcom, Intel), system vendors (Dell, HP, Huawei), and data centre network operators, to gather requirements and develop these standards.

Congestion Isolation (802.1qcz)

  • Problem: In current data centre networks, when congestion occurs, priority flow control (PFC) is used to prevent packet loss. However, PFC can lead to head-of-line blocking and congestion spreading.

  • Solution: Congestion isolation moves flows that are causing congestion into separate queues, allowing smaller flows to pass through without being blocked. This technique works in conjunction with higher-layer end-to-end congestion control mechanisms like ECN and TCP.

  • Benefits: Congestion isolation helps to eliminate head-of-line blocking, makes more intelligent use of switch memory, and reduces the need for PFC.

PFC Enhancements (802.1qdt)

  • Problem: PFC, introduced over a decade ago, has some configuration complexities and compatibility issues with MACsec encryption.

  • Solutions: a. Automatic calculation of headroom: Headroom is the extra buffer space needed to absorb in-flight data when a sender is told to stop transmitting. The enhancement proposes using LLDP to measure delays and automatically calculate the optimal headroom, reducing memory waste and configuration complexity. b. Protection of PFC frames with MACsec: The enhancement specifies a new shim layer to allow the encryption of PFC frames, enabling compatibility between older and newer implementations of MACsec.

  • Benefits: These enhancements simplify PFC deployment and enable its use in encrypted networks, such as when running RDMA between data centres.

Source Flow Control (802.1qdw)

  • Problem: PFC can cause head-of-line blocking and congestion spreading throughout the network.

  • Solution: Source flow control detects congestion in the network and sends a message back to the source node to pause or control the flow. It includes a proxy mode where the top-of-rack switch can intercept the message and convert it to a PFC message, allowing gradual deployment without requiring immediate upgrades to all servers and NICs.

  • Benefits: Source flow control provides the benefits of PFC (simple flow control) but at the source, reducing latency and avoiding head-of-line blocking. The proxy mode simplifies deployment, and the signalling message carries rich flow information for advanced flow control techniques.

Definitions and Acronyms

  1. PFC: Priority Flow Control

  2. ECN: Explicit Congestion Notification

  3. TCP: Transmission Control Protocol

  4. MACsec: Media Access Control Security

  5. RDMA: Remote Direct Memory Access

  6. LLDP: Link Layer Discovery Protocol

  7. NIC: Network Interface Card

  8. IEEE: Institute of Electrical and Electronics Engineers

Technical Terms

  1. Head-of-line blocking: A phenomenon where a packet at the head of a queue blocks the transmission of subsequent packets, even if those packets are destined for a different, uncongested output port. This can lead to increased latency and reduced throughput.

  2. Congestion spreading: When congestion in one part of the network causes congestion in other parts of the network due to the propagation of backpressure or flow control signals. This can lead to a cascade effect, degrading overall network performance.

  3. Priority Flow Control (PFC): A link-level flow control mechanism defined in IEEE 802.1Qbb that allows the receiver to pause traffic on a per-priority basis. When a receiver's buffer is full, it sends a PFC frame to the sender, instructing it to pause transmission for a specific priority.

  4. Explicit Congestion Notification (ECN): A mechanism defined in RFC 3168 that allows end-to-end congestion notification without dropping packets. ECN-capable routers and switches mark packets when congestion is imminent, and the receiver echoes this information back to the sender, which can then reduce its transmission rate.

  5. MACsec (Media Access Control Security): An IEEE 802.1AE standard that provides hop-by-hop data confidentiality, integrity, and origin authentication for Ethernet frames. MACsec encrypts and authenticates the entire Ethernet frame, including the header and payload.

  6. Headroom: The extra buffer space reserved in a switch or router to accommodate in-flight packets when a sender is instructed to stop transmitting due to congestion. Adequate headroom is necessary to prevent packet loss during the time it takes for the sender to receive and act upon the pause frame.

  7. Link Layer Discovery Protocol (LLDP): A vendor-neutral link layer protocol defined in IEEE 802.1AB that allows network devices to advertise their identity, capabilities, and neighbours on a local area network. LLDP can be used to automate network provisioning, troubleshooting, and management.

  8. Remote Direct Memory Access (RDMA): A technology that enables direct memory access from the memory of one computer into that of another without involving either computer's operating system. RDMA offers high-throughput, low-latency networking, which is crucial for high-performance computing and storage applications.

  9. Shim layer: In networking, a shim is a layer of abstraction that sits between two other layers to provide compatibility or additional functionality. In the context of PFC enhancements, a new shim layer is proposed to enable the encryption of PFC frames using MACsec, ensuring compatibility between older and newer implementations.

These technical terms and acronyms are essential for understanding the proposed enhancements to Ethernet for high-performance data centres.

By addressing issues like head-of-line blocking, congestion spreading, and providing features like congestion isolation, PFC enhancements, and source flow control, these standards aim to improve the performance, efficiency, and deployment of Ethernet in modern data centre environments.

Scalability

Scalability refers to a network's ability to grow and accommodate increasing amounts of data and devices without compromising performance or reliability.

InfiniBand is designed to scale exceptionally well, thanks to its switched fabric architecture, allowing for efficient scaling of AI clusters. It supports a large number of nodes (up to 48,000) and enables the creation of high-performance, low-latency interconnects between GPUs, which is essential for distributed AI training and inference.

Features like the Subnet Manager (SM) and forwarding path calculation also add value.

Understanding the Subnet Manager in InfiniBand Networks

The Subnet Manager (SM) is essential in managing the operational aspects of an InfiniBand network. Its primary responsibilities include:

  • Discovering Network Topology: Identifying the structure and layout of the network.

  • Assigning Local Identifiers (LIDs): Every port connected to the network is assigned a unique LID for identification and routing purposes.

  • Calculating and Programming Switch Forwarding Tables: Ensuring data packets are correctly routed through the network.

  • Programming Partition Key (PKey) Tables: These are used at Host Channel Adapters (HCAs) and switches for partitioning and secure communication.

  • Programming QoS Tables: This includes setting up Service Level to Virtual Lane mapping tables and Virtual Lane arbitration tables to manage quality of service across the network.

  • Monitoring Changes in the Fabric: Keeping track of any alterations within the network's structure or operations.

In InfiniBand networks, there is typically more than one SM, but only one acts as the Master SM at any given time, with others in standby mode. If the Master SM fails, one of the Standby SMs automatically takes over.

These features allow InfiniBand to support tens of thousands of nodes without the need for complex network configurations or protocols that Ethernet faces - like or .

The argument is traditional Ethernet has limited scalability due to its reliance on and the need for complex protocols like spanning tree to prevent network loops. This can lead to performance degradation and reduced efficiency as the network grows.

However, again - some argue recent advancements in Ethernet have addressed these limitations and improved its scalability.

Technologies like VXLAN (Virtual Extensible LAN) and SDN (software-defined networking) have been introduced to tackle these challenges.

VXLAN allows Ethernet networks to scale to millions of nodes by encapsulating Ethernet frames within UDP packets, while SDN separates the network control plane from the data plane, enabling more flexible and scalable network configuration and management.

Nonetheless, there are networking experts out there, that remain adamant the packet-based nature of Ethernet can lead to congestion and performance degradation as the cluster size grows.

Understanding VXLAN: Virtual eXtensible Local Area Network

Overview

Virtual Extensible Local Area Network (VXLAN) is a network virtualisation technology that addresses the scalability problems associated with large cloud computing deployments.

It allows for the creation of a large number of virtualised Layer 2 networks, often referred to as "overlay networks", across a Layer 3 network infrastructure.

Originally defined by the Internet Engineering Task Force (IETF) in RFC 7348, VXLAN plays a crucial role in the construction of scalable and secure multitenant cloud environments.

How VXLAN Works

VXLAN operates by encapsulating a traditional Layer 2 Ethernet frame into a User Datagram Protocol (UDP) packet.

This encapsulation extends the Layer 2 network over a Layer 3 network by segmenting the traffic through a VXLAN Network Identifier (VNI), akin to how VLAN uses VLAN IDs.

Each VXLAN segment is isolated from the others, ensuring that data traffic within one segment remains private from another, much like apartments in a building.

The main components involved in VXLAN architecture include:

VXLAN Tunnel Endpoint (VTEP): Devices that perform the encapsulation and decapsulation of Ethernet frames into and out of VXLAN packets. VTEPs are identified by their IP addresses and are usually implemented within hypervisors or physical switches.

VXLAN Header: Added to the Ethernet frame during encapsulation, it includes a 24-bit VNI, significantly increasing the number of potential VLANs from the traditional 4096 to over 16 million.

Key Advantages of VXLAN

  • Scalability: By extending the address space for segment IDs, VXLAN can support up to 16 million virtual networks, compared to the 4096 VLANs supported by standard Ethernet.

  • Flexibility: VXLAN can be used over any IP network, including across data centres and various geographical locations, without being restricted by the underlying network topology.

  • Improved Security: Each VXLAN segment is isolated, enhancing security by segregating different organizational units or tenants within the same physical infrastructure.

Problems VXLAN Solves

  • Network Segmentation: In environments where multiple tenants must coexist on the same physical network, such as in cloud data centres, VXLAN allows for effective isolation and segmentation.

  • Geographical Independence: VXLAN makes it possible to stretch Layer 2 networks across geographically dispersed data centres, facilitating disaster recovery and load balancing.

  • Overcomes VLAN Limitations: Traditional VLAN IDs are limited to 4096, insufficient for large-scale cloud environments. VXLAN addresses this by allowing for up to 16 million network identifiers.

Applications of VXLAN

VXLAN is particularly useful in the following scenarios:

  • Data Centres: Enables cloud service providers to create a scalable and segmented network architecture.

  • Multi-tenant Environments: Helps in isolating traffic in scenarios where multiple users or customers share the same physical infrastructure.

  • Over Data Centre Interconnects (DCI): Useful for extending Layer 2 services across different data centres.

VXLAN and EVPN

Ethernet VPN (EVPN) often pairs with VXLAN to provide a robust control plane that manages VXLAN overlay networks in large-scale deployments. EVPN enhances VXLAN by providing dynamic learning of network endpoints and multicast optimization, which are not available in VXLAN deployments that use static provisioning.

Conclusion

VXLAN technology represents a significant advancement in network virtualization, offering scalability, flexibility, and security for modern data centre and cloud environments.

As networks continue to grow in size and complexity, VXLAN provides an effective solution for managing vast amounts of traffic while maintaining strict isolation and segmentation of data.

The consensus seems that despite these advancements, InfiniBand still maintains an advantage in terms of scalability, particularly in large-scale, high-performance computing environments where low latency and high bandwidth are critical.

Compatibility and Ecosystem

One of the challenges in replacing InfiniBand with Ethernet is the familiarity and expertise that HPC professionals have with InfiniBand. The InfiniBand ecosystem is well-established and purpose-built for HPC and AI workloads.

The following

  1. Message Passing Interface (MPI): This is a standardized and portable message-passing system designed to function on a variety of parallel computing architectures. MPI libraries like Open MPI are often optimised for InfiniBand to enhance communication speeds and efficiency in cluster environments.

  2. NVIDIA Collective Communications Library (NCCL): Optimised for multi-GPU and multi-node communication, NCCL leverages InfiniBand's high throughput and low latency characteristics to accelerate training in deep learning environments.

  3. RDMA (Remote Direct Memory Access) libraries: These allow direct memory access from the memory of one computer into that of another without involving either one's operating system. This enables high-throughput, low-latency networking, which is critical for performance in large-scale computing environments.

  4. GPUDirect: This suite of technologies from NVIDIA provides various methods for direct device-to-device communication via PCIe and InfiniBand interconnects, enhancing data transfer speeds and reducing latency.

  5. Intel’s Performance Scaled Messaging 2 (PSM2): This protocol is designed to exploit the features of high-performance networks like InfiniBand in large-scale HPC clusters, providing reliable transport and high-bandwidth capabilities.

These tools and libraries are critical for developers working in HPC and AI, as they provide necessary functionalities that harness the full potential of InfiniBand's network capabilities.

In saying this, Ethernet has a much larger installed base and a wider ecosystem of compatible devices and software due to its long history and widespread adoption - but not in AI and HPC.

Ethernet's main advantage lies in its flexibility and ubiquity. Most data centres and cloud providers already use Ethernet extensively, and having a single networking technology across the entire infrastructure could simplify management and reduce costs. So theoretically - it should not be much of a 'switching cost' to move to Ethernet if it is offers the same performance as Infiniband.

In summary, while InfiniBand offers superior performance and reliability for AI and HPC workloads, Ethernet's widespread adoption, extensive ecosystem, and lower costs make it the primary choice for most general-purpose networking applications.

One View on why Ethernet will win

Mid 2023, Ram Nalaga's from Broadcom argued that Ethernet is the technology of choice for GPU clusters. He argues Ethernet is the only networking technology needed, even for building large-scale GPU clusters with 1,000 or more GPUs.

Why?

  1. Ethernet has proven its ability to adapt and meet new requirements over multiple decades.

  2. Recent requirements for clustered GPUs include low latencies, controlled tail latency, and the ability to build large topologies without causing idle time on GPUs.

  3. Ethernet has evolved to provide capabilities such as losslessness and congestion management, making it suitable for large-scale GPU clusters.

  4. Ethernet is ubiquitous, with 600 million ports shipped and sold annually, leading to rapid innovation and economies of scale that are not available with alternative technologies like InfiniBand.

  5. The scale of Ethernet and the presence of multiple players in the market drive innovation and provide economic advantages.

  6. Ethernet offers both the technical capabilities and the economics needed to build large-scale AI/ML clusters.

In summary, it is argued Ethernet's adaptability, recent advancements, ubiquity, and economic advantages make it the best choice for building large-scale GPU clusters, and that alternative technologies like InfiniBand are not necessary.

The Future

While Ethernet is making inroads, InfiniBand's entrenched position and purpose-built design make it a formidable incumbent.

The ultimate outcome will depend on factors such as the rate of Ethernet's performance improvements, the willingness of HPC professionals to adopt new technologies, and the strategic decisions made by major vendors and cloud providers.

But the decision may not be up to the professionals, their are numerous reasons why you would consider Ethernet for a new network build. InfiniBand, despite its many advantages for high-performance applications, faces several significant shortcomings that influence its broader adoption:

  1. High Cost: InfiniBand's network components, like cards and switches, are substantially more expensive than Ethernet equivalents, making it less economically viable for many sectors.

  2. Elevated O&M Expenses: Operating and maintaining an InfiniBand network requires specialised skills due to its unique infrastructure, which can lead to higher operational costs and challenges in finding qualified personnel.

  3. Vendor Lock-in: The use of proprietary protocols in InfiniBand equipment restricts interoperability with other technologies and can lead to dependency on specific vendors.

  4. Long Lead Times: Delays in the availability of InfiniBand components can pose risks to project timelines and scalability.

  5. Slow Upgrade Cycle: Dependence on vendor-specific upgrade cycles can slow down network improvements, affecting overall network performance and adaptability.

The counter view

While there have been advancements in Ethernet technology and the many 'shortcomings' of Infiniband, there are are a range of technical and practical reasons why InfiniBand remains the preferred choice for high-performance AI workloads.

  1. In the eyes of HPC technicians, Ethernet has not proven its performance credentials yet

  2. While Ethernet has made improvements in terms of bandwidth and switch capacity, it still faces challenges when it comes to supporting the massive scale required by AI workloads.

  3. InfiniBand has a well-established ecosystem in the high-performance computing (HPC) and AI domains. Many AI frameworks, libraries, and tools are optimised for InfiniBand

  4. Transitioning to Ethernet would require significant effort in terms of software adaptation and optimisation. Existing AI pipelines and workflows would need to be modified to work efficiently with Ethernet, which could be a time-consuming and costly process.

  5. InfiniBand provides a lossless fabric with built-in flow control and congestion management mechanisms. It guarantees reliable data delivery and maintains a consistent level of performance, even under heavy load. Despite Ethernet's advancement in this area, the experts are still not convinced.

  6. While Ethernet is generally considered more cost-effective than InfiniBand, the cost difference becomes less significant when considering the total cost of ownership (TCO) for AI infrastructures. The higher performance and efficiency of InfiniBand can lead to better resource utilisation and reduced overall costs.

In conclusion, while Ethernet has made significant advancements, InfiniBand remains the preferred choice for AI computing due to its superior performance, scalability, ecosystem compatibility, reliability, and QoS guarantees.

The debate will continue

In conclusion, InfiniBand and Ethernet are both powerful networking technologies with their own strengths and weaknesses.

While InfiniBand has been the dominant choice for AI and HPC workloads due to its superior performance and reliability, Ethernet is rapidly closing the gap with recent advancements in speed, features, and scalability.

As the demand for high-performance networking in AI and HPC continues to grow, the competition between these two technologies will likely intensify.

The ultimate winner will depend on a complex interplay of factors, including technological advancements, industry adoption, and strategic decisions by key players.

Regardless of the outcome, one thing is certain: the future of high-performance networking will be shaped by the ongoing battle between InfiniBand and Ethernet, with significant implications for the growth and development of AI and HPC applications in the years to come.

One of the key reasons for this performance is its use of instead of TCP, making it suited for large, performance-critical workloads.

RDMA (Remote Direct Memory Access)
NVIDIA Mellanox LinkX InfiniBand DAC Cables
We all know what an Ethernet cable looks like!
The network topology structure of InfiniBand is visually represented in the diagram above
Page cover image