Page cover image

Infiniband versus Ethernet

Networking Technologies

InfiniBand and Ethernet are both networking technologies used for data communication, but they have different origins, architectures, and target applications.

Definitions

InfiniBand

InfiniBand is a high-performance, low-latency interconnect standard designed for connecting servers, storage systems, and other data centre components. It was developed specifically for high-performance computing (HPC) and other data-intensive applications.

NVIDIA Mellanox LinkX InfiniBand DAC Cables

Ethernet

Ethernet is a widely-used, general-purpose networking technology that connects devices in local area networks (LANs) and wide area networks (WANs). It was initially designed for office environments and has evolved to support a wide range of applications and speeds.

We all know what an Ethernet cable looks like!

Technology Standards

Both InfiniBand and Ethernet are technology standards, not specific products. They define the rules and specifications for communication between devices in a network.

Various vendors develop and manufacture products (such as network adapters, switches, and cables) that adhere to these standards.

The Components of Infiniband

Channel Adapters: These are like the on-ramps and off-ramps of the superhighway. They help computers and devices get on and off the InfiniBand network. There are two types: a. Host Channel Adapters (HCAs): These are used by things like servers or storage devices to connect to the InfiniBand network. b. Target Channel Adapters (TCAs): These are used by special devices, usually for storage.

Switches: These are like the traffic lights of the superhighway. They make sure data goes where it needs to go quickly and efficiently.

Routers: If you want to connect multiple InfiniBand networks together (like connecting multiple superhighways), you use routers. They help data move between the different networks.

Cables and Connectors: These are like the roads of the superhighway. They physically connect everything together.

The smallest complete InfiniBand network is called a "subnet," and you can connect multiple subnets together using routers to create a huge InfiniBand network. It's like connecting multiple cities with superhighways.

What makes InfiniBand special is that it's really fast (low-latency), can move a lot of data quickly (high-bandwidth), and is easy to manage (low-management cost).

It's perfect for connecting a lot of computers together (clustering), moving data between computers (communications), storing data (storage), and managing everything (management) - all in one network.

So, in a nutshell, InfiniBand is a super-fast, efficient way for computers and devices to talk to each other, making it easier to build big, powerful computer systems.

Key Differences

Performance

It is argued InfiniBand offers lower latency and higher throughput compared to Ethernet, making it more suitable for performance-critical applications like HPC and AI workloads.

This is why InfiniBand has historically been the go-to networking solution for HPC and AI workloads - low latency, high bandwidth, and deterministic performance characteristics.

One of the key reasons for this performance is its use of RDMA (Remote Direct Memory Access) instead of TCP, making it suited for large, performance-critical workloads.

Nonetheless. the Ultra Ethernet Consortium, led by companies like Broadcom, Cisco, and Intel, is pushing for the adoption of Ethernet in AI networking. They argue that modern Ethernet can offer similar, if not better, performance compared to InfiniBand at a lower cost.

It is true that Ethernet has made strides with technologies like RDMA over Converged Ethernet (RoCE), but it still lacks the same level of performance as InfiniBand. Studies have shown that to achieve comparable performance, Ethernet needs to operate at speeds 1.3 times faster than InfiniBand.

So while Ethernet has caught up with InfiniBand in terms of raw bandwidth, with both technologies offering 400 Gbps speeds, InfiniBand still maintains an edge in terms of latency and deterministic performance.

The network topology structure of InfiniBand is visually represented in the diagram above

Remote Direct Memory Access (RDMA)

As highlighted, InfiniBand natively supports RDMA, which is one of the reasons it has historically been preferred for HPC and AI workloads.

Ethernet, on the other hand, has traditionally relied on TCP/IP for data transport, which involves more overhead and higher latency.

However, more recent Ethernet standards, such as RoCE (RDMA over Converged Ethernet), have added support for RDMA over Ethernet networks, allowing Ethernet to achieve lower latency and higher throughput than traditional TCP/IP-based Ethernet.

Reliability

In the context of networking, reliability refers to a network's ability to consistently deliver data without errors or loss. Two key aspects of reliability are fabric behavior and flow control.

Fabric behavior describes the overall structure and performance of a network.

In a reliable fabric, data is delivered consistently and without loss, even in the presence of network congestion or device failures.

For example, InfiniBand provides a lossless fabric, ensuring that no data is lost during transmission, regardless of network conditions. This is achieved through end-to-end flow control, which prevents data loss by signalling the sender to slow down or stop sending data when the receiver is unable to process incoming data at the same rate.

On the other hand, Ethernet, in its basic form, is a best-effort delivery system, meaning that it will attempt to deliver data but does not guarantee successful delivery.

However, recent Ethernet standards, such as priority flow control (PFC), have added support for lossless behavior, allowing Ethernet to provide more reliable data delivery, similar to InfiniBand.

InfiniBand provides a lossless fabric with built-in flow control and congestion management mechanisms. It guarantees reliable data delivery and maintains a consistent level of performance, even under heavy load.

The work to make Ethernet the leading network fabric for the modern data centre

The goal is to make Ethernet the leading network fabric for modern data centres and high-performance computing by providing features that are currently more prevalent in other technologies like InfiniBand.

The IEEE 802.1 working group is collaborating with industry partners, including chip vendors (Broadcom, Intel), system vendors (Dell, HP, Huawei), and data centre network operators, to gather requirements and develop these standards.

Congestion Isolation (802.1qcz)

  • Problem: In current data centre networks, when congestion occurs, priority flow control (PFC) is used to prevent packet loss. However, PFC can lead to head-of-line blocking and congestion spreading.

  • Solution: Congestion isolation moves flows that are causing congestion into separate queues, allowing smaller flows to pass through without being blocked. This technique works in conjunction with higher-layer end-to-end congestion control mechanisms like ECN and TCP.

  • Benefits: Congestion isolation helps to eliminate head-of-line blocking, makes more intelligent use of switch memory, and reduces the need for PFC.

PFC Enhancements (802.1qdt)

  • Problem: PFC, introduced over a decade ago, has some configuration complexities and compatibility issues with MACsec encryption.

  • Solutions: a. Automatic calculation of headroom: Headroom is the extra buffer space needed to absorb in-flight data when a sender is told to stop transmitting. The enhancement proposes using LLDP to measure delays and automatically calculate the optimal headroom, reducing memory waste and configuration complexity. b. Protection of PFC frames with MACsec: The enhancement specifies a new shim layer to allow the encryption of PFC frames, enabling compatibility between older and newer implementations of MACsec.

  • Benefits: These enhancements simplify PFC deployment and enable its use in encrypted networks, such as when running RDMA between data centres.

Source Flow Control (802.1qdw)

  • Problem: PFC can cause head-of-line blocking and congestion spreading throughout the network.

  • Solution: Source flow control detects congestion in the network and sends a message back to the source node to pause or control the flow. It includes a proxy mode where the top-of-rack switch can intercept the message and convert it to a PFC message, allowing gradual deployment without requiring immediate upgrades to all servers and NICs.

  • Benefits: Source flow control provides the benefits of PFC (simple flow control) but at the source, reducing latency and avoiding head-of-line blocking. The proxy mode simplifies deployment, and the signalling message carries rich flow information for advanced flow control techniques.

Definitions and Acronyms

  1. PFC: Priority Flow Control

  2. ECN: Explicit Congestion Notification

  3. TCP: Transmission Control Protocol

  4. MACsec: Media Access Control Security

  5. RDMA: Remote Direct Memory Access

  6. LLDP: Link Layer Discovery Protocol

  7. NIC: Network Interface Card

  8. IEEE: Institute of Electrical and Electronics Engineers

Technical Terms

  1. Head-of-line blocking: A phenomenon where a packet at the head of a queue blocks the transmission of subsequent packets, even if those packets are destined for a different, uncongested output port. This can lead to increased latency and reduced throughput.

  2. Congestion spreading: When congestion in one part of the network causes congestion in other parts of the network due to the propagation of backpressure or flow control signals. This can lead to a cascade effect, degrading overall network performance.

  3. Priority Flow Control (PFC): A link-level flow control mechanism defined in IEEE 802.1Qbb that allows the receiver to pause traffic on a per-priority basis. When a receiver's buffer is full, it sends a PFC frame to the sender, instructing it to pause transmission for a specific priority.

  4. Explicit Congestion Notification (ECN): A mechanism defined in RFC 3168 that allows end-to-end congestion notification without dropping packets. ECN-capable routers and switches mark packets when congestion is imminent, and the receiver echoes this information back to the sender, which can then reduce its transmission rate.

  5. MACsec (Media Access Control Security): An IEEE 802.1AE standard that provides hop-by-hop data confidentiality, integrity, and origin authentication for Ethernet frames. MACsec encrypts and authenticates the entire Ethernet frame, including the header and payload.

  6. Headroom: The extra buffer space reserved in a switch or router to accommodate in-flight packets when a sender is instructed to stop transmitting due to congestion. Adequate headroom is necessary to prevent packet loss during the time it takes for the sender to receive and act upon the pause frame.

  7. Link Layer Discovery Protocol (LLDP): A vendor-neutral link layer protocol defined in IEEE 802.1AB that allows network devices to advertise their identity, capabilities, and neighbours on a local area network. LLDP can be used to automate network provisioning, troubleshooting, and management.

  8. Remote Direct Memory Access (RDMA): A technology that enables direct memory access from the memory of one computer into that of another without involving either computer's operating system. RDMA offers high-throughput, low-latency networking, which is crucial for high-performance computing and storage applications.

  9. Shim layer: In networking, a shim is a layer of abstraction that sits between two other layers to provide compatibility or additional functionality. In the context of PFC enhancements, a new shim layer is proposed to enable the encryption of PFC frames using MACsec, ensuring compatibility between older and newer implementations.

These technical terms and acronyms are essential for understanding the proposed enhancements to Ethernet for high-performance data centres.

By addressing issues like head-of-line blocking, congestion spreading, and providing features like congestion isolation, PFC enhancements, and source flow control, these standards aim to improve the performance, efficiency, and deployment of Ethernet in modern data centre environments.

Scalability

Scalability refers to a network's ability to grow and accommodate increasing amounts of data and devices without compromising performance or reliability.

InfiniBand is designed to scale exceptionally well, thanks to its switched fabric architecture, allowing for efficient scaling of AI clusters. It supports a large number of nodes (up to 48,000) and enables the creation of high-performance, low-latency interconnects between GPUs, which is essential for distributed AI training and inference.

Features like the Subnet Manager (SM) and forwarding path calculation also add value.

Understanding the Subnet Manager in InfiniBand Networks

The Subnet Manager (SM) is essential in managing the operational aspects of an InfiniBand network. Its primary responsibilities include:

  • Discovering Network Topology: Identifying the structure and layout of the network.

  • Assigning Local Identifiers (LIDs): Every port connected to the network is assigned a unique LID for identification and routing purposes.

  • Calculating and Programming Switch Forwarding Tables: Ensuring data packets are correctly routed through the network.

  • Programming Partition Key (PKey) Tables: These are used at Host Channel Adapters (HCAs) and switches for partitioning and secure communication.

  • Programming QoS Tables: This includes setting up Service Level to Virtual Lane mapping tables and Virtual Lane arbitration tables to manage quality of service across the network.

  • Monitoring Changes in the Fabric: Keeping track of any alterations within the network's structure or operations.

In InfiniBand networks, there is typically more than one SM, but only one acts as the Master SM at any given time, with others in standby mode. If the Master SM fails, one of the Standby SMs automatically takes over.

These features allow InfiniBand to support tens of thousands of nodes without the need for complex network configurations or protocols that Ethernet faces - like or .

The argument is traditional Ethernet has limited scalability due to its reliance on and the need for complex protocols like spanning tree to prevent network loops. This can lead to performance degradation and reduced efficiency as the network grows.

However, again - some argue recent advancements in Ethernet have addressed these limitations and improved its scalability.

Technologies like VXLAN (Virtual Extensible LAN) and SDN (software-defined networking) have been introduced to tackle these challenges.

VXLAN allows Ethernet networks to scale to millions of nodes by encapsulating Ethernet frames within UDP packets, while SDN separates the network control plane from the data plane, enabling more flexible and scalable network configuration and management.

Nonetheless, there are networking experts out there, that remain adamant the packet-based nature of Ethernet can lead to congestion and performance degradation as the cluster size grows.

Understanding VXLAN: Virtual eXtensible Local Area Network

Overview

Virtual Extensible Local Area Network (VXLAN) is a network virtualisation technology that addresses the scalability problems associated with large cloud computing deployments.

It allows for the creation of a large number of virtualised Layer 2 networks, often referred to as "overlay networks", across a Layer 3 network infrastructure.

Originally defined by the Internet Engineering Task Force (IETF) in RFC 7348, VXLAN plays a crucial role in the construction of scalable and secure multitenant cloud environments.

How VXLAN Works

VXLAN operates by encapsulating a traditional Layer 2 Ethernet frame into a User Datagram Protocol (UDP) packet.

This encapsulation extends the Layer 2 network over a Layer 3 network by segmenting the traffic through a VXLAN Network Identifier (VNI), akin to how VLAN uses VLAN IDs.

Each VXLAN segment is isolated from the others, ensuring that data traffic within one segment remains private from another, much like apartments in a building.

The main components involved in VXLAN architecture include:

VXLAN Tunnel Endpoint (VTEP): Devices that perform the encapsulation and decapsulation of Ethernet frames into and out of VXLAN packets. VTEPs are identified by their IP addresses and are usually implemented within hypervisors or physical switches.

VXLAN Header: Added to the Ethernet frame during encapsulation, it includes a 24-bit VNI, significantly increasing the number of potential VLANs from the traditional 4096 to over 16 million.

Key Advantages of VXLAN

  • Scalability: By extending the address space for segment IDs, VXLAN can support up to 16 million virtual networks, compared to the 4096 VLANs supported by standard Ethernet.

  • Flexibility: VXLAN can be used over any IP network, including across data centres and various geographical locations, without being restricted by the underlying network topology.

  • Improved Security: Each VXLAN segment is isolated, enhancing security by segregating different organizational units or tenants within the same physical infrastructure.

Problems VXLAN Solves

  • Network Segmentation: In environments where multiple tenants must coexist on the same physical network, such as in cloud data centres, VXLAN allows for effective isolation and segmentation.

  • Geographical Independence: VXLAN makes it possible to stretch Layer 2 networks across geographically dispersed data centres, facilitating disaster recovery and load balancing.

  • Overcomes VLAN Limitations: Traditional VLAN IDs are limited to 4096, insufficient for large-scale cloud environments. VXLAN addresses this by allowing for up to 16 million network identifiers.

Applications of VXLAN

VXLAN is particularly useful in the following scenarios:

  • Data Centres: Enables cloud service providers to create a scalable and segmented network architecture.

  • Multi-tenant Environments: Helps in isolating traffic in scenarios where multiple users or customers share the same physical infrastructure.

  • Over Data Centre Interconnects (DCI): Useful for extending Layer 2 services across different data centres.

VXLAN and EVPN

Ethernet VPN (EVPN) often pairs with VXLAN to provide a robust control plane that manages VXLAN overlay networks in large-scale deployments. EVPN enhances VXLAN by providing dynamic learning of network endpoints and multicast optimization, which are not available in VXLAN deployments that use static provisioning.

Conclusion

VXLAN technology represents a significant advancement in network virtualization, offering scalability, flexibility, and security for modern data centre and cloud environments.

As networks continue to grow in size and complexity, VXLAN provides an effective solution for managing vast amounts of traffic while maintaining strict isolation and segmentation of data.

The consensus seems that despite these advancements, InfiniBand still maintains an advantage in terms of scalability, particularly in large-scale, high-performance computing environments where low latency and high bandwidth are critical.

Compatibility and Ecosystem

One of the challenges in replacing InfiniBand with Ethernet is the familiarity and expertise that HPC professionals have with InfiniBand. The InfiniBand ecosystem is well-established and purpose-built for HPC and AI workloads.

The following

  1. Message Passing Interface (MPI): This is a standardized and portable message-passing system designed to function on a variety of parallel computing architectures. MPI libraries like Open MPI are often optimised for InfiniBand to enhance communication speeds and efficiency in cluster environments.

  2. NVIDIA Collective Communications Library (NCCL): Optimised for multi-GPU and multi-node communication, NCCL leverages InfiniBand's high throughput and low latency characteristics to accelerate training in deep learning environments.

  3. RDMA (Remote Direct Memory Access) libraries: These allow direct memory access from the memory of one computer into that of another without involving either one's operating system. This enables high-throughput, low-latency networking, which is critical for performance in large-scale computing environments.

  4. GPUDirect: This suite of technologies from NVIDIA provides various methods for direct device-to-device communication via PCIe and InfiniBand interconnects, enhancing data transfer speeds and reducing latency.

  5. Intel’s Performance Scaled Messaging 2 (PSM2): This protocol is designed to exploit the features of high-performance networks like InfiniBand in large-scale HPC clusters, providing reliable transport and high-bandwidth capabilities.

These tools and libraries are critical for developers working in HPC and AI, as they provide necessary functionalities that harness the full potential of InfiniBand's network capabilities.

In saying this, Ethernet has a much larger installed base and a wider ecosystem of compatible devices and software due to its long history and widespread adoption - but not in AI and HPC.

Ethernet's main advantage lies in its flexibility and ubiquity. Most data centres and cloud providers already use Ethernet extensively, and having a single networking technology across the entire infrastructure could simplify management and reduce costs. So theoretically - it should not be much of a 'switching cost' to move to Ethernet if it is offers the same performance as Infiniband.

In summary, while InfiniBand offers superior performance and reliability for AI and HPC workloads, Ethernet's widespread adoption, extensive ecosystem, and lower costs make it the primary choice for most general-purpose networking applications.

One View on why Ethernet will win

Mid 2023, Ram Nalaga's from Broadcom argued that Ethernet is the technology of choice for GPU clusters. He argues Ethernet is the only networking technology needed, even for building large-scale GPU clusters with 1,000 or more GPUs.

Why?

  1. Ethernet has proven its ability to adapt and meet new requirements over multiple decades.

  2. Recent requirements for clustered GPUs include low latencies, controlled tail latency, and the ability to build large topologies without causing idle time on GPUs.

  3. Ethernet has evolved to provide capabilities such as losslessness and congestion management, making it suitable for large-scale GPU clusters.

  4. Ethernet is ubiquitous, with 600 million ports shipped and sold annually, leading to rapid innovation and economies of scale that are not available with alternative technologies like InfiniBand.

  5. The scale of Ethernet and the presence of multiple players in the market drive innovation and provide economic advantages.

  6. Ethernet offers both the technical capabilities and the economics needed to build large-scale AI/ML clusters.

In summary, it is argued Ethernet's adaptability, recent advancements, ubiquity, and economic advantages make it the best choice for building large-scale GPU clusters, and that alternative technologies like InfiniBand are not necessary.

The Future

While Ethernet is making inroads, InfiniBand's entrenched position and purpose-built design make it a formidable incumbent.

The ultimate outcome will depend on factors such as the rate of Ethernet's performance improvements, the willingness of HPC professionals to adopt new technologies, and the strategic decisions made by major vendors and cloud providers.

But the decision may not be up to the professionals, their are numerous reasons why you would consider Ethernet for a new network build. InfiniBand, despite its many advantages for high-performance applications, faces several significant shortcomings that influence its broader adoption:

  1. High Cost: InfiniBand's network components, like cards and switches, are substantially more expensive than Ethernet equivalents, making it less economically viable for many sectors.

  2. Elevated O&M Expenses: Operating and maintaining an InfiniBand network requires specialised skills due to its unique infrastructure, which can lead to higher operational costs and challenges in finding qualified personnel.

  3. Vendor Lock-in: The use of proprietary protocols in InfiniBand equipment restricts interoperability with other technologies and can lead to dependency on specific vendors.

  4. Long Lead Times: Delays in the availability of InfiniBand components can pose risks to project timelines and scalability.

  5. Slow Upgrade Cycle: Dependence on vendor-specific upgrade cycles can slow down network improvements, affecting overall network performance and adaptability.

The counter view

While there have been advancements in Ethernet technology and the many 'shortcomings' of Infiniband, there are are a range of technical and practical reasons why InfiniBand remains the preferred choice for high-performance AI workloads.

  1. In the eyes of HPC technicians, Ethernet has not proven its performance credentials yet

  2. While Ethernet has made improvements in terms of bandwidth and switch capacity, it still faces challenges when it comes to supporting the massive scale required by AI workloads.

  3. InfiniBand has a well-established ecosystem in the high-performance computing (HPC) and AI domains. Many AI frameworks, libraries, and tools are optimised for InfiniBand

  4. Transitioning to Ethernet would require significant effort in terms of software adaptation and optimisation. Existing AI pipelines and workflows would need to be modified to work efficiently with Ethernet, which could be a time-consuming and costly process.

  5. InfiniBand provides a lossless fabric with built-in flow control and congestion management mechanisms. It guarantees reliable data delivery and maintains a consistent level of performance, even under heavy load. Despite Ethernet's advancement in this area, the experts are still not convinced.

  6. While Ethernet is generally considered more cost-effective than InfiniBand, the cost difference becomes less significant when considering the total cost of ownership (TCO) for AI infrastructures. The higher performance and efficiency of InfiniBand can lead to better resource utilisation and reduced overall costs.

In conclusion, while Ethernet has made significant advancements, InfiniBand remains the preferred choice for AI computing due to its superior performance, scalability, ecosystem compatibility, reliability, and QoS guarantees.

The debate will continue

In conclusion, InfiniBand and Ethernet are both powerful networking technologies with their own strengths and weaknesses.

While InfiniBand has been the dominant choice for AI and HPC workloads due to its superior performance and reliability, Ethernet is rapidly closing the gap with recent advancements in speed, features, and scalability.

As the demand for high-performance networking in AI and HPC continues to grow, the competition between these two technologies will likely intensify.

The ultimate winner will depend on a complex interplay of factors, including technological advancements, industry adoption, and strategic decisions by key players.

Regardless of the outcome, one thing is certain: the future of high-performance networking will be shaped by the ongoing battle between InfiniBand and Ethernet, with significant implications for the growth and development of AI and HPC applications in the years to come.

Last updated

Logo

Continuum - Accelerated Artificial Intelligence

Continuum WebsiteAxolotl Platform

Copyright Continuum Labs - 2023