Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) is a technology designed to improve the performance of collective operations in Message Passing Interface (MPI) and machine learning applications.

It was invented to address the challenges associated with the increasing demand for efficient collective communication in large-scale, high-performance computing (HPC) and AI systems.

History and Problem

As high performance computing systems grew larger and more complex, the traditional approach of relying on CPUs and GPUs to handle collective operations became a bottleneck.

Collective operations, such as data aggregation and reduction, involve gathering and processing data from multiple nodes, which can be time-consuming and resource-intensive when performed solely by CPUs and GPUs. This led to increased latency, reduced bandwidth utilisation, and suboptimal performance scaling.

NVIDIA recognised the need for a more efficient solution and developed SHARP to offload collective operations from CPUs and GPUs to the network itself.

By leveraging the capabilities of NVIDIA's smart switches and adapters, SHARP aims to optimise collective operations and improve overall system performance.

Architecture

SHARP's architecture is based on the concept of aggregation trees, which are logical structures overlaid on the physical network topology.

These trees consist of aggregation nodes (ANs) that perform data reduction and aggregation operations as data traverses up the tree towards the root.

Key components of SHARP's architecture include

Aggregation Nodes (ANs): Logical entities that perform data reduction and aggregation operations. ANs can be implemented in switches or end-nodes.

Aggregation Trees: Logical tree structures that define the data reduction pattern. Multiple trees can be built over the same physical topology to support parallel jobs.

Aggregation Manager (AM): A central entity responsible for managing and coordinating SHARP operations, including resource allocation and monitoring.

SHARP-enabled Switches and Adapters: NVIDIA's hardware devices, such as Switch-IB 2, Quantum, and ConnectX adapters, that support SHARP operations and offload collective communication from CPUs and GPUs.

Benefits and Problem Solving

SHARP addresses the challenges of efficient collective communication in several ways:

Reduced Data Movement: By performing data reduction and aggregation within the network, SHARP reduces the amount of data that needs to be transmitted between endpoints. This decreases network congestion and improves bandwidth utilisation.

Lower Latency: SHARP's in-network aggregation minimises the number of hops data needs to travel, reducing the overall latency of collective operations.

Offloading from CPUs and GPUs: By offloading collective operations to the network, SHARP frees up valuable CPU and GPU resources for computation tasks, improving overall system efficiency.

Scalability: SHARP's aggregation trees can be built over various network topologies, enabling efficient scaling of collective operations as system sizes grow.

Support for Parallel Jobs: With the ability to create multiple aggregation trees over the same topology, SHARP allows multiple parallel jobs to benefit from in-network aggregation simultaneously.

SHARP has undergone multiple generations of development, with the latest being SHARP Rev 3.0.0.

Each generation has introduced enhancements and expanded capabilities, such as support for both low-latency and streaming aggregation operations, increased number of aggregation trees, and compatibility with the latest NVIDIA hardware.

In summary, NVIDIA SHARP is a technology that tackles the challenges of efficient collective communication in large-scale HPC and AI systems.

By offloading collective operations to the network and leveraging the capabilities of NVIDIA's smart switches and adapters, SHARP significantly reduces latency, improves bandwidth utilisation, and enhances overall system performance.

It represented a significant step forward in enabling scalable and efficient collective communication in modern computing environments.

PreviousEvaluating Modern GPU Interconnect NextNext-generation networking in AI environments

Last updated 1 year ago

Was this helpful?