# Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

<mark style="color:blue;">**NVIDIA Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)**</mark> is a technology designed to improve the performance of collective operations in <mark style="color:blue;">**Message Passing Interface (MPI)**</mark> and machine learning applications.&#x20;

It was invented to address the challenges associated with the increasing demand for efficient collective communication in large-scale, high-performance computing (HPC) and AI systems.

{% file src="<https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FtDT21xENMIT7FebeIxPG%2FSHARP%20Hardware%20Architecture.pdf?alt=media&token=64e0c2ee-ea6b-4219-9ede-e8c4969fe786>" %}
Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction
{% endfile %}

### <mark style="color:purple;">History and Problem</mark>

As high performance computing systems grew larger and more complex, the traditional approach of relying on CPUs and GPUs to <mark style="color:blue;">**handle collective operations**</mark> became a bottleneck.&#x20;

Collective operations, such as data aggregation and reduction, involve gathering and processing data from multiple nodes, which can be time-consuming and resource-intensive when performed solely by CPUs and GPUs. This led to increased latency, reduced bandwidth utilisation, and suboptimal performance scaling.

NVIDIA recognised the need for a more efficient solution and *<mark style="color:yellow;">**developed SHARP to offload collective operations from CPUs and GPUs to the network itself**</mark>*.&#x20;

By leveraging the capabilities of <mark style="color:blue;">**NVIDIA's smart switches and adapters**</mark>, SHARP aims to optimise collective operations and improve overall system performance.

### <mark style="color:purple;">Architecture</mark>

SHARP's architecture is based on the concept of <mark style="color:blue;">**aggregation trees**</mark>, which are logical structures overlaid on the physical network topology. &#x20;

These trees consist of <mark style="color:blue;">**aggregation nodes (ANs)**</mark> that perform data reduction and aggregation operations as data traverses up the tree towards the root.

#### <mark style="color:green;">**Key components of SHARP's architecture include**</mark>

<mark style="color:purple;">**Aggregation Nodes (ANs):**</mark> Logical entities that perform data reduction and aggregation operations. ANs can be implemented in switches or end-nodes.

<mark style="color:purple;">**Aggregation Trees:**</mark> Logical tree structures that define the data reduction pattern. Multiple trees can be built over the same physical topology to support parallel jobs.

<mark style="color:purple;">**Aggregation Manager (AM):**</mark> A central entity responsible for managing and coordinating SHARP operations, including resource allocation and monitoring.

<mark style="color:purple;">**SHARP-enabled Switches and Adapters:**</mark> NVIDIA's hardware devices, such as Switch-IB 2, Quantum, and ConnectX adapters, that support SHARP operations and offload collective communication from CPUs and GPUs.

### <mark style="color:purple;">Benefits and Problem Solving</mark>

SHARP addresses the challenges of efficient collective communication in several ways:

<mark style="color:purple;">**Reduced Data Movement:**</mark> By performing data reduction and aggregation <mark style="color:yellow;">**within the network**</mark>, SHARP reduces the amount of data that needs to be transmitted between endpoints. This decreases network congestion and improves bandwidth utilisation.

<mark style="color:purple;">**Lower Latency:**</mark> SHARP's in-network aggregation <mark style="color:yellow;">**minimises the number of hops**</mark> data needs to travel, reducing the overall latency of collective operations.

<mark style="color:purple;">**Offloading from CPUs and GPUs:**</mark> By offloading collective operations to the network, SHARP frees up valuable CPU and GPU resources for computation tasks, improving overall system efficiency.

<mark style="color:purple;">**Scalability:**</mark> SHARP's aggregation trees can be built over various network topologies, enabling efficient scaling of collective operations as system sizes grow.

<mark style="color:purple;">**Support for Parallel Jobs:**</mark> With the ability to create multiple aggregation trees over the same topology, SHARP allows multiple parallel jobs to benefit from in-network aggregation simultaneously.

SHARP has undergone multiple generations of development, with the latest being SHARP Rev 3.0.0.

Each generation has introduced enhancements and expanded capabilities, such as support for both low-latency and streaming aggregation operations, increased number of aggregation trees, and compatibility with the latest NVIDIA hardware.

In summary, NVIDIA SHARP is a technology that tackles the challenges of efficient collective communication in large-scale HPC and AI systems.&#x20;

By offloading collective operations to the network and leveraging the capabilities of NVIDIA's smart switches and adapters, SHARP significantly reduces latency, improves bandwidth utilisation, and enhances overall system performance.&#x20;

It represented a significant step forward in enabling scalable and efficient collective communication in modern computing environments.
