# NVIDIA Grace CPU Superchip

The <mark style="color:blue;">**NVIDIA Grace CPU Superchip**</mark> marks a significant advance in data centre CPUs.&#x20;

It's designed specifically for the intensive demands of modern cloud, enterprise, high-performance computing (HPC), and various other computational-intensive tasks.&#x20;

The Grace CPU offers a revolutionary approach in its architecture, providing superior performance efficiency, thereby redefining cost-effectiveness and operational efficiency in data centres.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2F3Nbeg7RsGXISk6uihfVz%2Fimage.png?alt=media&#x26;token=1203969b-2e07-4be3-9088-f832f1a2e2dd" alt=""><figcaption><p>NVIDIA Grace CPU Superchip</p></figcaption></figure>

### <mark style="color:green;">Arm Architecture</mark>

The Arm architecture in the NVIDIA Grace CPU Superchip, specifically the Neoverse V2 cores, incorporates several advanced features to meet the high-performance and efficiency demands of data centre CPUs.&#x20;

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FfTBeXnHxdcLNFis5PSlV%2Fimage.png?alt=media&#x26;token=780752f5-6966-43a1-9db0-6451e9fb8794" alt=""><figcaption><p>NVIDIA Grace Arm Neoverse V2 Core is the highest performing Arm Neoverse core with support for SVE2 to accelerate key applications</p></figcaption></figure>

### <mark style="color:green;">**Armv9.0-A Architecture**</mark>

This is an extension of the Armv8-A architecture, up to Armv8.5-A.&#x20;

{% embed url="<https://developer.arm.com/Architectures/A-Profile%20Architecture>" %}

The Grace CPU supports application binaries built for Armv8 through Armv8.5-A, ensuring backward compatibility with CPUs like Ampere Altra, AWS Graviton2, and AWS Graviton3.

### <mark style="color:blue;">SIMD Vectorisation with SVE2 and NEON</mark>

* <mark style="color:blue;">**SIMD (Single Instruction Multiple Data)**</mark> is a technique that allows a single instruction to perform the same operation on multiple data elements simultaneously, improving performance for certain types of workloads.
* The Grace CPU supports <mark style="color:yellow;">**two SIMD instruction sets**</mark>: <mark style="color:blue;">**SVE2 (Scalable Vector Extension version 2)**</mark> and <mark style="color:blue;">**NEON (Advanced SIMD)**</mark>.
* <mark style="color:blue;">**SVE2**</mark> is a newer and more advanced SIMD extension that allows for variable-length vector operations, enabling better performance and flexibility compared to fixed-length SIMD architectures.
* <mark style="color:blue;">**NEON**</mark> is a well-established SIMD extension that has been widely used in Arm-based processors for multimedia and signal processing applications.
* By supporting both SVE2 and NEON, the Grace CPU allows more software to take advantage of SIMD optimisations, resulting in improved performance for suitable workloads.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FwVXh3kTbi6Oinklgve3H%2Fimage.png?alt=media&#x26;token=38024ef8-1384-4dba-883c-7d050339ad47" alt=""><figcaption><p>Scalar vs. SIMD Operations</p></figcaption></figure>

#### <mark style="color:blue;">Atomic Operations</mark>

Atomic operations are indivisible operations that ensure data consistency in multi-threaded or multi-processor environments.

The <mark style="color:blue;">**Large System Extension (LSE)**</mark> in the Grace CPU provides hardware support for low-cost atomic operations.

LSE improves system throughput by optimising common synchronisation primitives like[ locks ](#user-content-fn-1)[^1]and mutexes[^2], which are used for coordinating access to shared resources between CPUs.

With LSE, the Grace CPU can efficiently handle CPU-to-CPU communication and synchronisation, leading to better overall system performance in multi-processor setups.

<details>

<summary><mark style="color:green;">Understanding LSE</mark></summary>

Large System Extensions (LSE) are enhancements to the Arm architecture that improve the performance of atomic operations in systems with many processors, particularly useful in multi-core environments like servers running Arm Neoverse processors.

In multi-core systems, where multiple processors or threads may simultaneously access shared data, maintaining data integrity during read-modify-write cycles is crucial. Traditional approaches often used load exclusive and store exclusive instructions, which can become inefficient in systems with a high number of processors due to the increased complexity and contention.

LSE introduces new atomic instructions with the Armv8.1-A architecture, simplifying these operations by allowing them to be performed as single, indivisible operations. This significantly reduces the complexity of programming for concurrency and improves performance and scalability by minimizing the overhead associated with coordinating access to shared data.

#### Key Features of LSE

1. **Atomic Instructions**: Includes operations like Compare and Swap (CAS, CASP), Swap (SWP), and atomic memory operations (LD\<op>, ST\<op>), which support direct atomic modifications on memory.
2. **Simplified Coding**: Reduces the need for complex lock-based programming by ensuring that atomic operations are handled as single, indivisible operations that are easier to write and less error-prone.
3. **Improved Performance**: Especially beneficial in systems with high core counts, as it minimizes the overhead and latency associated with managing access to shared resources.
4. **Support in Newer Architectures**: Further enhancements and support for LSE were introduced in subsequent versions like Armv8.2-A and Armv8.4-A.

#### Practical Impact and Adoption

1. **Server Environments**: LSE is particularly relevant in server environments like AWS's Graviton processors, where high-performance and efficient multi-core processing are critical. For example, AWS Graviton2 and Graviton3 instances benefit significantly from LSE by providing improved performance metrics over previous generations.
2. **Software Compatibility**: Software that utilizes traditional lock-based mechanisms may not immediately benefit from LSE unless recompiled or adapted to use the new atomic instructions, which can lead to performance enhancements.
3. **Developer Tools**: Understanding whether tools and compilers (like GCC) support LSE can be crucial for developers aiming to optimize applications for Arm Neoverse platforms.

</details>

#### <mark style="color:blue;">Additional Armv9 Features</mark>

* **Cryptographic acceleration:** Enhances the performance of cryptographic algorithms.
* **Scalable profiling extension:** Provides tools for detailed performance analysis.
* **Virtualization extensions:** Improve the efficiency and security of virtualised environments.
* **Full memory encryption and secure boot:** Enhance the security of data and the integrity of the boot process.

#### <mark style="color:blue;">Application Performance</mark>

The Grace CPU Superchip is optimised for a range of high-performance computing and data centre applications.&#x20;

It excels in environments where rapid access to large amounts of data is necessary, and its high memory bandwidth supports complex computational tasks efficiently.  This makes it particularly well-suited for scientific simulations, large-scale data analytics, and machine learning workloads.

In summary, the Arm architecture in the NVIDIA Grace CPU Superchip provides a robust foundation for building and running high-performance, energy-efficient applications in modern data centres.

Its support for advanced SIMD operations, atomic instructions, and high-speed interconnects, along with comprehensive backward compatibility and security features, positions it as a powerful solution for the most demanding computational tasks.

### <mark style="color:green;">Software Ecosystem</mark>

The NVIDIA Grace CPU benefits from a rich and mature software ecosystem, which is an important reason for its adoption and usability across various domains.

The extensive software support ensures that users can seamlessly transition to the Grace CPU platform without the need for significant modifications to their existing software stack.

Compatibility with major Linux distributions is a key advantage, as Linux is the predominant operating system in data centres, high-performance computing (HPC), and cloud environments.

This compatibility allows users to leverage the vast collection of software packages, libraries, and tools available in these distributions, making it easier to deploy and manage applications on the Grace CPU.

The Grace CPU ecosystem also includes a wide range of development tools, such as compilers, libraries, profilers, and system administration utilities. These tools are essential for developers to build, optimise, and debug their applications effectively.

The importance of this extensive software ecosystem cannot be overstated. It enables users to leverage their existing skills, knowledge, and codebase, reducing the learning curve and time-to-deployment when adopting the Grace CPU.

The ecosystem also fosters collaboration and innovation, as developers can build upon existing tools and libraries to create new applications and solutions.

### <mark style="color:green;">Programmability and Toolchain Support</mark>

Programming the NVIDIA Grace CPU is straightforward and flexible, thanks to the comprehensive toolchain support.

Developers can choose from a variety of programming languages and paradigms based on their preferences and the requirements of their applications.

For applications built using interpreted or Just-in-Time (JIT) compiled languages like Python, Java, PHP, and Node.js, the Grace CPU provides seamless compatibility.

These applications can run on the Grace CPU without any modifications, as the interpreters and runtimes for these languages are readily available on Arm-based systems.

Compiled applications, written in languages such as C, C++, and Fortran, can also be easily ported to the Grace CPU.

Existing application binaries compiled for Armv8 or later architectures can run on the Grace CPU without the need for recompilation.&#x20;

However, to take full advantage of the Grace CPU's capabilities and maximise performance, *<mark style="color:yellow;">developers can recompile their applications using compilers that support the Armv9 Instruction Set Architecture (ISA)</mark>* and optimise for the Neoverse V2 microarchitecture.

The Grace CPU is supported by a wide range of compilers, including:

1. <mark style="color:blue;">**GCC (GNU Compiler Collection):**</mark> A popular open-source compiler suite that supports multiple languages and architectures, including Arm.
2. <mark style="color:blue;">**LLVM:**</mark> A modular and extensible compiler framework that provides a collection of tools and libraries for building compilers and related tools.
3. <mark style="color:blue;">**NVHPC (NVIDIA HPC Compilers):**</mark> NVIDIA's suite of compilers optimised for NVIDIA hardware, enabling high-performance computing on the Grace CPU.
4. <mark style="color:blue;">**Arm Compiler for Linux:**</mark> Arm's proprietary compiler suite, specifically designed for Arm-based systems, offering advanced optimisations and performance tuning.
5. <mark style="color:blue;">**HPE Cray Compilers:**</mark> A set of compilers optimized for HPC workloads, with support for the Grace CPU.

<mark style="color:blue;">**NVIDIA's Nsight**</mark> family of performance analysis tools is particularly noteworthy for developers working with the Grace CPU.&#x20;

Nsight Systems and Nsight Compute provide deep insights into application behavior, allowing developers to identify performance bottlenecks, visualise GPU and CPU utilisation, and optimise resource usage.&#x20;

These tools seamlessly integrate with the NVIDIA software ecosystem, supporting CUDA, OpenMP, and other parallel programming models.

The extensive programmability and toolchain support for the NVIDIA Grace CPU empowers developers to create high-performance, scalable, and efficient applications across various domains.&#x20;

By leveraging the available compilers, libraries, and tools, developers can unlock the full potential of the Grace CPU and accelerate their application development and optimisation processes.

### <mark style="color:green;">**NVLink-C2C (Chip-to-Chip)**</mark>

The <mark style="color:blue;">**NVIDIA Grace CPU Superchip**</mark> employs the [<mark style="color:blue;">**NVLink Chip-to-Chip (C2C)**</mark>](https://training.continuumlabs.ai/infrastructure/servers-and-chips/nvlink-switch) interconnect technology to effectively manage and enhance data communication between multiple processing units, tackling common issues found in traditional multi-socket server architectures related to data bottlenecks and inefficient memory access patterns.&#x20;

This technology enables high-speed connectivity between chips, facilitating faster data transfer rates which are essential for performance in high-compute environments.

This provides high bandwidth <mark style="color:yellow;">(900 GB/s)</mark> connections between chips, facilitating fast data transfer and reducing latency in multi-chip configurations.

#### <mark style="color:green;">Understanding NUMA (Non-Uniform Memory Access)</mark>

<mark style="color:blue;">**NUMA**</mark> is a <mark style="color:yellow;">**server architecture**</mark> used in <mark style="color:yellow;">**multi-processor systems**</mark> where the memory access time depends on the memory location relative to a processor.&#x20;

In NUMA, each processor (or group of processors) has its own local memory, and accessing memory across processors takes more time than accessing local memory. This architecture is designed to scale the performance of high-end servers by minimising the bottleneck of a single shared memory.

<details>

<summary><mark style="color:green;">How Does NUMA Work?</mark></summary>

<mark style="color:blue;">**Non-Uniform Memory Access (NUMA)**</mark> is a computer memory design that helps improve the efficiency and speed of your computer's processing power when dealing with multiple processors. Here’s a simplified breakdown of how NUMA works:

#### <mark style="color:green;">What is NUMA?</mark>

NUMA stands for Non-Uniform Memory Access. It's a setup in computers with multiple processors, where each processor has its own local memory that it accesses faster than non-local memory, or memory that belongs to another processor.

#### <mark style="color:green;">How Does NUMA Work?</mark>

Here’s a simplified step-by-step explanation:

1. <mark style="color:purple;">**Memory Partitioning**</mark><mark style="color:purple;">:</mark> The computer's memory is divided into portions, with each portion directly connected to a specific processor.
2. <mark style="color:purple;">**Processor Awareness**</mark><mark style="color:purple;">:</mark> Each processor knows which portion of memory is its own (local) and which portions are connected to other processors (remote).
3. **Accessing Memory**: Processors first check their local memory for data. If it’s not there, they check the remote memory. Accessing local memory is faster.
4. <mark style="color:purple;">**Managing Data Flow**</mark><mark style="color:purple;">:</mark> When data must be accessed from remote memory, it travels through a high-speed link that connects the processors, but it's still slower than accessing local memory.
5. <mark style="color:purple;">**Optimising Performance**</mark><mark style="color:purple;">:</mark> Techniques are used to ensure that processors access their local memory as much as possible, reducing the need for slower, remote memory access.
6. <mark style="color:purple;">**Scalability**</mark><mark style="color:purple;">:</mark> As more processors are added to the system, NUMA helps manage the memory among them efficiently, keeping the system fast.

#### <mark style="color:green;">Importance of NUMA</mark>

In environments where large-scale, complex computing tasks are common—like in scientific research, financial modeling, or data analysis—NUMA can significantly enhance performance by reducing the time processors spend waiting for data. This setup is crucial for high-performance computing (HPC) where time and speed are critical.

</details>

In a typical server with a [multi-socket configuration](#user-content-fn-3)[^3], each socket can have one or more dies[^4], and each die can represent multiple NUMA domains.&#x20;

#### <mark style="color:green;">How Sockets and Dies Relate to NUMA</mark>

NUMA, or Non-Uniform Memory Access, is a system design that optimises the use of memory by grouping cores and their nearest memory together into "NUMA nodes."&#x20;

Each NUMA node offers faster access to its own local memory than to the memory local to other nodes, enhancing performance for workloads that can utilise local memory effectively.

#### <mark style="color:green;">Challenges with Multi-Socket and Multi-Die Configurations</mark>

When you have a server with multiple sockets and possibly multiple dies within those sockets, the complexity increases:

<mark style="color:blue;">**NUMA Domains**</mark><mark style="color:blue;">:</mark> Each die might represent one or more NUMA domains, depending on its design and the memory connected to it. The more dies and sockets you have, the more NUMA domains there are.

<mark style="color:blue;">**Data Travel**</mark><mark style="color:blue;">:</mark> In multi-socket, multi-die environments, data might need to travel across different NUMA domains to be processed. For example, if a processor on one die needs data that is stored in the memory local to another die (or another socket altogether), this data must travel through the system's interconnects (like buses or fabric) to reach the requesting processor.

<mark style="color:blue;">**Latency Issues:**</mark> Each time data travels across these NUMA domains, it incurs latency. The farther the data has to travel—especially across sockets—the longer it takes. This is because accessing local memory is always faster than accessing remote memory associated with other dies or sockets.

<mark style="color:blue;">**Performance Impact**</mark><mark style="color:blue;">:</mark> Applications sensitive to memory access times can experience performance bottlenecks in such a setup. This is because the speed at which these applications run can be significantly affected by how quickly they can access the necessary data.

So while multi-socket and multi-die configurations provide more processing power and the ability to handle more tasks, they also introduce challenges in terms of memory access efficiency.&#x20;

Understanding and optimising the layout of sockets, dies, and NUMA domains is key to maximising the efficiency of such systems.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FBdrL7iAT82GvATvYFEBD%2Fimage.png?alt=media&#x26;token=327124bd-153b-47a6-9bff-5f50fcd8bd90" alt=""><figcaption><p>Comparison of the Grace CPU Superchip with NVLink-C2C compared to traditional server architecture</p></figcaption></figure>

### <mark style="color:green;">NVLink-C2C Interconnect: Alleviating NUMA Bottlenecks</mark>

The NVLink-C2C interconnect is a high-speed, direct connection technology developed by NVIDIA that provides a <mark style="color:yellow;">**substantial bandwidth of 900 GB/s between chips**</mark>.&#x20;

Here’s how NVLink-C2C addresses and alleviates NUMA bottlenecks:

<mark style="color:blue;">**High Bandwidth Communication**</mark><mark style="color:blue;">:</mark> By offering 900 GB/s, NVLink-C2C allows for *<mark style="color:yellow;">faster data transfer between the cores and memory across different NUMA nodes</mark>*. This high-speed data transfer capability is critical for workloads that require significant memory bandwidth and where data needs to be moved frequently and rapidly across processor nodes.

<mark style="color:blue;">**Simplified Memory Topology**</mark><mark style="color:blue;">:</mark> The Grace CPU Superchip uses a straightforward memory topology with *<mark style="color:yellow;">only two NUMA nodes</mark>*. This simplicity means that there are fewer "hops" for the data to make when moving from one processor or memory node to another, reducing latency and the complexity of memory access patterns.

<mark style="color:blue;">**Direct Chip-to-Chip Communication**</mark><mark style="color:blue;">:</mark> Unlike traditional interconnects that may require data to pass through multiple controllers or bridges, *<mark style="color:yellow;">NVLink-C2C provides a direct pathway between chips</mark>*. This setup not only speeds up data transfer but also minimises the latency typically associated with complex routing through different motherboard components.

<mark style="color:blue;">**Application Performance Improvement**</mark><mark style="color:blue;">:</mark> For application developers, this means easier optimisation for NUMA architectures, as the reduced number of NUMA nodes simplifies the logic for distributing and accessing data. Applications can perform more efficiently due to reduced waiting times for data and increased overall throughput.

### <mark style="color:green;">Scale Cores and Bandwidth with NVIDIA Scalable Coherency Fabric</mark>

The <mark style="color:blue;">**NVIDIA Scalable Coherency Fabric (SCF)**</mark> is an  architectural component used to manage the movement of data across different parts of a computing system, particularly in high-performance CPUs like the NVIDIA Grace CPU Superchip.&#x20;

It plays a role in maintaining data coherence and performance scalability across an increasing number of cores and higher bandwidth demands.&#x20;

Here's a breakdown of how it functions and its importance:

### <mark style="color:green;">Functionality of Scalable Coherency Fabric</mark>

<mark style="color:blue;">**Mesh Network**</mark>

SCF uses a mesh network topology, which interconnects multiple components like CPU cores, memory, and I/O devices through a grid-like pattern. This setup facilitates efficient data transfer across various points without overloading any single connection path.

<mark style="color:blue;">**Distributed Cache Architecture**</mark>

The fabric incorporates a distributed cache system, particularly [<mark style="color:yellow;">distributed L3 cache</mark>](#user-content-fn-5)[^5], which is shared among the CPU cores. This cache stores frequently accessed data close to the processor cores to reduce latency and improve speed when accessing this data.

<mark style="color:blue;">**Cache Switch Nodes**</mark>

Within the SCF, Cache Switch Nodes play a pivotal role. They *<mark style="color:yellow;">act as routers within the mesh network</mark>*, directing data between the CPU cores, the cache memory, and the system's input/output operations. These nodes ensure that data flows efficiently across the system, managing both the routing and the coherence of data.

<mark style="color:blue;">**High Bandwidth**</mark>

SCF supports extremely high bi-section bandwidth, with capabilities exceeding 3.2 terabytes per second. This high bandwidth is essential for handling the vast amounts of data processed in complex computations and maintaining system performance without bottlenecks.

The NVIDIA Scalable Coherency Fabric is a key architectural feature in NVIDIA's advanced CPU designs, providing a scalable, coherent, and efficient method to manage data flow and cache usage across an extensive array of cores and system components. This fabric ensures that as systems grow more complex, they continue to operate efficiently and coherently.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FCUJwM0slZ521WdJzpYhd%2Fimage.png?alt=media&#x26;token=f79f45cc-db8c-4290-be0b-05c6b92cadd8" alt=""><figcaption><p>NVIDIA Grace CPU and the NVIDIA Scalable Coherency Fabric, which join the Neoverse V2 cores, distributed cache and system IO in a high-bandwidth mesh interconnect</p></figcaption></figure>

### <mark style="color:green;">**LPDDR5X Memory**</mark>

<mark style="color:blue;">**Low-Power Double Data Rate 5X (LPDDR5X)**</mark> memory is an advanced version of the standard DDR memory used primarily in servers and high-performance computing systems.&#x20;

LPDDR5X is engineered to meet the demands of applications requiring high bandwidth and low power consumption, such as large-scale artificial intelligence (AI) and high-performance computing (HPC) workloads.&#x20;

<details>

<summary><mark style="color:green;"><strong>How LPDDR5X Memory Works</strong></mark></summary>

<mark style="color:blue;">**Enhanced Data Rate**</mark>

LPDDR5X can transmit more data per clock cycle compared to its predecessors. This is achieved through more efficient data bus utilisation and higher memory clock speeds, leading to increased overall bandwidth. This means that more data can be processed faster, which is crucial for memory-intensive applications.

<mark style="color:blue;">**Low Power Consumption**</mark>

LPDDR5X implements several features to reduce power usage, including a more refined manufacturing process that allows for lower voltage operations and improved I/O signalling techniques that decrease power draw during data transmission. This low power consumption is essential for reducing overall system energy costs and improving energy efficiency, especially in data canters where power cost can be a significant portion of operational expenses.

<mark style="color:blue;">**Error Correction Code (ECC)**</mark>

ECC within LPDDR5X helps to ensure data integrity by correcting errors that occur during data transfer. This is particularly important in environments where data corruption can lead to significant losses or inaccuracies, such as in financial computing or scientific research.

</details>

LPDDR5X in the NVIDIA Grace CPU Superchip strikes an optimal balance between high performance (through increased bandwidth), lower power consumption, and cost efficiency.&#x20;

Its implementation supports the demands of next-generation computing applications, providing a robust foundation for advancements in AI and HPC fields.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FnyKhwDeOFxExovyLY2gZ%2Fimage.png?alt=media&#x26;token=c24e84f5-5b4b-450d-bdd7-ef29a53fd456" alt=""><figcaption><p>Samsung, the <a href="https://www.sammobile.com/news/samsung-profit-q1-2024-rises-931-percent-chip-sales-recover/">world's biggest memory chip manufacturer</a>, has unveiled its fastest LPDDR5X <a href="https://www.sammobile.com/tag/dram/">DRAM</a> chip. The new chip can attain data transfer speeds of up to 10.7Gbps, higher than the <a href="https://www.sammobile.com/news/samsung-lpddr5x-dram-announced-galaxy-s22/">6.4Gbps LPDDR5X chip launched in 2021</a> and the <a href="https://www.sammobile.com/news/samsung-launches-industry-fastest-8-5gbps-lpddr5x-dram/">8.5Gbps LPDDR5X DRAM chip unveiled in 2022</a>.</p></figcaption></figure>

### <mark style="color:green;">Memory Partitioning and Monitoring (MPAM)</mark>

The NVIDIA Grace CPU incorporates Arm's <mark style="color:blue;">**Memory Partitioning and Monitoring (MPAM)**</mark> technology, which is designed to enhance the control and monitoring of memory and cache resources in a multi-tenant environment.&#x20;

This feature is especially useful in data centres and for applications that require strong isolation between different tasks or jobs to prevent them from interfering with each other's performance.

<figure><img src="https://1839612753-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FpV8SlQaC976K9PPsjApL%2Fuploads%2FYxObMilfZwb49NeLPkyg%2Fimage.png?alt=media&#x26;token=71c64984-0e92-4ed3-9e51-014f297de038" alt=""><figcaption><p>Grace CPU memory, SCF cache, PCIe, NVLink, and NVLink- C2C can be partitioned for cloud native workloads</p></figcaption></figure>

### <mark style="color:green;">Detailed explanation of how MPAM works</mark>

### <mark style="color:blue;">Memory Partitioning</mark>

<mark style="color:green;">**Partitioning System Cache and Memory Resources**</mark>

* MPAM allows the system to divide its cache and memory resources into partitions. Each partition can be allocated to different jobs or applications running on the system. This ensures that each job has access to its designated resources without being affected by the resource demands of other jobs.

&#x20;<mark style="color:green;">**Ensuring Performance Isolation**</mark>

* By partitioning the resources, MPAM ensures that the performance of one job does not suffer because another job is consuming an excessive amount of cache or memory resources. This is important in environments where multiple applications or users are sharing the same physical hardware, as it maintains stable and predictable performance.

### <mark style="color:blue;">Monitoring and Management</mark>

<mark style="color:green;">**SCF Cache Support for Partitioning**</mark>

* The NVIDIA-designed Scalable Coherency Fabric (SCF) Cache extends the capabilities of MPAM by supporting the partitioning of cache capacity. It allows for the allocation of specific portions of the cache to different jobs, further enhancing the ability to control and isolate system resources.

<mark style="color:green;">**Partitioning of I/O and Memory Bandwidth**</mark>

* In addition to cache capacity, MPAM and SCF together manage the partitioning of I/O bandwidth and memory bandwidth. This means that each job can have a designated amount of bandwidth, preventing scenarios where a bandwidth-heavy job could starve other jobs of the bandwidth they need to perform effectively.

### <mark style="color:blue;">Performance Monitoring Groups (PMGs)</mark>

<mark style="color:green;">**Monitoring Resource Utilisation**</mark>

* MPAM employs Performance Monitor Groups (PMGs) to keep track of how resources are being used by different jobs. PMGs can monitor various metrics, such as cache storage usage and memory bandwidth utilisation. This monitoring is vital for system administrators to understand performance dynamics and to make informed decisions about resource allocation.

<mark style="color:green;">**Insights for Optimisation**</mark>

* The data collected by PMGs help in identifying bottlenecks or inefficiencies in resource usage. Administrators can use this information to optimise the system for better performance and resource utilisation, adjusting partitions and allocations based on the actual needs of different jobs.

### <mark style="color:blue;">Benefits</mark>

* **Improved System Efficiency:** By ensuring that resources are not monopolised by a single job, MPAM helps in maintaining high overall system efficiency.
* **Enhanced Security and Isolation:** Resource partitioning also enhances security and isolation between different tenants or jobs, which is critical in multi-user environments.
* **Flexibility and Scalability:** MPAM provides the flexibility to adjust resource allocations in response to changing workloads, making systems more adaptable and scalable.

The integration of Arm’s MPAM in the NVIDIA Grace CPU enables sophisticated management and monitoring of memory and cache resources, ensuring that the system can handle multiple concurrent jobs efficiently and securely.&#x20;

This technology is particularly beneficial in high-performance computing and data centre environments where resource isolation and performance predictability are crucial.

### <mark style="color:purple;">NVIDIA Grace CPU Superchip Specifications</mark>

| **Feature**                     | **Specification**                                |
| ------------------------------- | ------------------------------------------------ |
| **Core Count**                  | 144 Arm Neoverse V2 Cores with 4x128b SVE2       |
| **Cache**                       | L1: 64KB i-cache + 64KB d-cache per core         |
|                                 | L2: 1MB per core                                 |
|                                 | L3: 228MB total                                  |
| **Base Frequency**              | 3.1 GHz                                          |
| **All-Core SIMD Frequency**     | 3.0 GHz                                          |
| **Memory**                      | LPDDR5X sizes: 240GB, 480GB, 960GB options       |
| **Memory Bandwidth**            | Up to 768 GB/s for 960GB memory                  |
|                                 | Up to 1024 GB/s for 240GB, 480GB memory          |
| **NVLink-C2C Bandwidth**        | 900GB/s                                          |
| **PCIe Links**                  | Up to 8x PCIe Gen5 x16, with bifurcation options |
| **Module Thermal Design Power** | 500W TDP with memory                             |
| **Form Factor**                 | Superchip module                                 |
| **Thermal Solution**            | Air cooled or liquid cooled                      |

### <mark style="color:purple;">Overview and Impact</mark>

The Grace CPU Superchip is engineered to tackle the most demanding data centre and HPC environments, providing up to twice the performance per watt compared to current x86 platforms.

It simplifies the architecture of data centres by integrating critical components which traditionally resided in multiple server units into a single chip.&#x20;

This integration not only boosts power efficiency but also enhances the density and simplifies the system design.

This CPU is particularly advantageous for applications requiring intensive computational power such as deep learning, scientific computation, and real-time data analytics.&#x20;

With its robust suite of technologies, including the advanced NVSwitch and ECC memory, the Grace CPU Superchip sets a new standard for data centre CPUs, ensuring that enterprises can handle expansive workloads with greater efficiency and reliability.

The NVIDIA Grace CPU Superchip represents a significant leap forward in data centre processing, delivering unparalleled performance and efficiency that align with the needs of modern enterprises and research institutions.

[^1]: A **lock** is a general term for mechanisms that enforce limits on access to a resource in an environment where there are many threads or processes that might access the same resource. Locks help ensure that only one thread or process can access the resource at any given time.

[^2]: A **mutex** is a specific type of lock that allows only one thread to access a resource at a time. It's a mutual exclusion object that provides ownership along with locking capabilities, meaning if a thread locks a mutex, only that same thread is allowed to unlock it. This property makes mutexes a safe and efficient tool for managing resource access among threads

[^3]: A socket in a server is essentially a physical connector on the motherboard. Each socket hosts a separate CPU chip (processor). Multi-socket configurations, where a motherboard has more than one socket, allow the system to use multiple CPUs. This increases the system's processing power and its ability to handle more tasks simultaneously.

[^4]: A die refers to the piece of silicon inside the CPU chip that contains the actual circuits, including cores and caches. A single CPU may contain one or more dies. When a CPU has multiple dies, it's often referred to as a "multi-die" configuration.

[^5]: Distributed L3 cache refers to a cache memory shared among all cores of a processor. This setup allows multiple cores to access a larger, common cache pool efficiently, which improves data retrieval speeds and reduces latency by storing frequently accessed data close to the processor cores. This shared resource enhances the performance of the CPU by minimising the need to fetch data from slower, external memory sources
