NVIDIA Grace CPU Superchip

The NVIDIA Grace CPU Superchip marks a significant advance in data centre CPUs.

It's designed specifically for the intensive demands of modern cloud, enterprise, high-performance computing (HPC), and various other computational-intensive tasks.

The Grace CPU offers a revolutionary approach in its architecture, providing superior performance efficiency, thereby redefining cost-effectiveness and operational efficiency in data centres.

Arm Architecture

The Arm architecture in the NVIDIA Grace CPU Superchip, specifically the Neoverse V2 cores, incorporates several advanced features to meet the high-performance and efficiency demands of data centre CPUs.

Armv9.0-A Architecture

This is an extension of the Armv8-A architecture, up to Armv8.5-A.

A-Profile Architecture

The Grace CPU supports application binaries built for Armv8 through Armv8.5-A, ensuring backward compatibility with CPUs like Ampere Altra, AWS Graviton2, and AWS Graviton3.

SIMD Vectorisation with SVE2 and NEON

SIMD (Single Instruction Multiple Data) is a technique that allows a single instruction to perform the same operation on multiple data elements simultaneously, improving performance for certain types of workloads.
The Grace CPU supports two SIMD instruction sets: SVE2 (Scalable Vector Extension version 2) and NEON (Advanced SIMD).
SVE2 is a newer and more advanced SIMD extension that allows for variable-length vector operations, enabling better performance and flexibility compared to fixed-length SIMD architectures.
NEON is a well-established SIMD extension that has been widely used in Arm-based processors for multimedia and signal processing applications.
By supporting both SVE2 and NEON, the Grace CPU allows more software to take advantage of SIMD optimisations, resulting in improved performance for suitable workloads.

Atomic Operations

Atomic operations are indivisible operations that ensure data consistency in multi-threaded or multi-processor environments.

The Large System Extension (LSE) in the Grace CPU provides hardware support for low-cost atomic operations.

LSE improves system throughput by optimising common synchronisation primitives likeand , which are used for coordinating access to shared resources between CPUs.

With LSE, the Grace CPU can efficiently handle CPU-to-CPU communication and synchronisation, leading to better overall system performance in multi-processor setups.

Understanding LSE

Large System Extensions (LSE) are enhancements to the Arm architecture that improve the performance of atomic operations in systems with many processors, particularly useful in multi-core environments like servers running Arm Neoverse processors.

In multi-core systems, where multiple processors or threads may simultaneously access shared data, maintaining data integrity during read-modify-write cycles is crucial. Traditional approaches often used load exclusive and store exclusive instructions, which can become inefficient in systems with a high number of processors due to the increased complexity and contention.

LSE introduces new atomic instructions with the Armv8.1-A architecture, simplifying these operations by allowing them to be performed as single, indivisible operations. This significantly reduces the complexity of programming for concurrency and improves performance and scalability by minimizing the overhead associated with coordinating access to shared data.

Key Features of LSE

Atomic Instructions: Includes operations like Compare and Swap (CAS, CASP), Swap (SWP), and atomic memory operations (LD<op>, ST<op>), which support direct atomic modifications on memory.
Simplified Coding: Reduces the need for complex lock-based programming by ensuring that atomic operations are handled as single, indivisible operations that are easier to write and less error-prone.
Improved Performance: Especially beneficial in systems with high core counts, as it minimizes the overhead and latency associated with managing access to shared resources.
Support in Newer Architectures: Further enhancements and support for LSE were introduced in subsequent versions like Armv8.2-A and Armv8.4-A.

Practical Impact and Adoption

Server Environments: LSE is particularly relevant in server environments like AWS's Graviton processors, where high-performance and efficient multi-core processing are critical. For example, AWS Graviton2 and Graviton3 instances benefit significantly from LSE by providing improved performance metrics over previous generations.
Software Compatibility: Software that utilizes traditional lock-based mechanisms may not immediately benefit from LSE unless recompiled or adapted to use the new atomic instructions, which can lead to performance enhancements.
Developer Tools: Understanding whether tools and compilers (like GCC) support LSE can be crucial for developers aiming to optimize applications for Arm Neoverse platforms.

Additional Armv9 Features

Cryptographic acceleration: Enhances the performance of cryptographic algorithms.
Scalable profiling extension: Provides tools for detailed performance analysis.
Virtualization extensions: Improve the efficiency and security of virtualised environments.
Full memory encryption and secure boot: Enhance the security of data and the integrity of the boot process.

Application Performance

The Grace CPU Superchip is optimised for a range of high-performance computing and data centre applications.

It excels in environments where rapid access to large amounts of data is necessary, and its high memory bandwidth supports complex computational tasks efficiently. This makes it particularly well-suited for scientific simulations, large-scale data analytics, and machine learning workloads.

In summary, the Arm architecture in the NVIDIA Grace CPU Superchip provides a robust foundation for building and running high-performance, energy-efficient applications in modern data centres.

Its support for advanced SIMD operations, atomic instructions, and high-speed interconnects, along with comprehensive backward compatibility and security features, positions it as a powerful solution for the most demanding computational tasks.

Software Ecosystem

The NVIDIA Grace CPU benefits from a rich and mature software ecosystem, which is an important reason for its adoption and usability across various domains.

The extensive software support ensures that users can seamlessly transition to the Grace CPU platform without the need for significant modifications to their existing software stack.

Compatibility with major Linux distributions is a key advantage, as Linux is the predominant operating system in data centres, high-performance computing (HPC), and cloud environments.

This compatibility allows users to leverage the vast collection of software packages, libraries, and tools available in these distributions, making it easier to deploy and manage applications on the Grace CPU.

The Grace CPU ecosystem also includes a wide range of development tools, such as compilers, libraries, profilers, and system administration utilities. These tools are essential for developers to build, optimise, and debug their applications effectively.

The importance of this extensive software ecosystem cannot be overstated. It enables users to leverage their existing skills, knowledge, and codebase, reducing the learning curve and time-to-deployment when adopting the Grace CPU.

The ecosystem also fosters collaboration and innovation, as developers can build upon existing tools and libraries to create new applications and solutions.

Programmability and Toolchain Support

Programming the NVIDIA Grace CPU is straightforward and flexible, thanks to the comprehensive toolchain support.

Developers can choose from a variety of programming languages and paradigms based on their preferences and the requirements of their applications.

For applications built using interpreted or Just-in-Time (JIT) compiled languages like Python, Java, PHP, and Node.js, the Grace CPU provides seamless compatibility.

These applications can run on the Grace CPU without any modifications, as the interpreters and runtimes for these languages are readily available on Arm-based systems.

Compiled applications, written in languages such as C, C++, and Fortran, can also be easily ported to the Grace CPU.

Existing application binaries compiled for Armv8 or later architectures can run on the Grace CPU without the need for recompilation.

However, to take full advantage of the Grace CPU's capabilities and maximise performance, developers can recompile their applications using compilers that support the Armv9 Instruction Set Architecture (ISA) and optimise for the Neoverse V2 microarchitecture.

The Grace CPU is supported by a wide range of compilers, including:

GCC (GNU Compiler Collection): A popular open-source compiler suite that supports multiple languages and architectures, including Arm.
LLVM: A modular and extensible compiler framework that provides a collection of tools and libraries for building compilers and related tools.
NVHPC (NVIDIA HPC Compilers): NVIDIA's suite of compilers optimised for NVIDIA hardware, enabling high-performance computing on the Grace CPU.
Arm Compiler for Linux: Arm's proprietary compiler suite, specifically designed for Arm-based systems, offering advanced optimisations and performance tuning.
HPE Cray Compilers: A set of compilers optimized for HPC workloads, with support for the Grace CPU.

NVIDIA's Nsight family of performance analysis tools is particularly noteworthy for developers working with the Grace CPU.

Nsight Systems and Nsight Compute provide deep insights into application behavior, allowing developers to identify performance bottlenecks, visualise GPU and CPU utilisation, and optimise resource usage.

These tools seamlessly integrate with the NVIDIA software ecosystem, supporting CUDA, OpenMP, and other parallel programming models.

The extensive programmability and toolchain support for the NVIDIA Grace CPU empowers developers to create high-performance, scalable, and efficient applications across various domains.

By leveraging the available compilers, libraries, and tools, developers can unlock the full potential of the Grace CPU and accelerate their application development and optimisation processes.

NVLink-C2C (Chip-to-Chip)

The NVIDIA Grace CPU Superchip employs the NVLink Chip-to-Chip (C2C) interconnect technology to effectively manage and enhance data communication between multiple processing units, tackling common issues found in traditional multi-socket server architectures related to data bottlenecks and inefficient memory access patterns.

This technology enables high-speed connectivity between chips, facilitating faster data transfer rates which are essential for performance in high-compute environments.

This provides high bandwidth (900 GB/s) connections between chips, facilitating fast data transfer and reducing latency in multi-chip configurations.

Understanding NUMA (Non-Uniform Memory Access)

NUMA is a server architecture used in multi-processor systems where the memory access time depends on the memory location relative to a processor.

In NUMA, each processor (or group of processors) has its own local memory, and accessing memory across processors takes more time than accessing local memory. This architecture is designed to scale the performance of high-end servers by minimising the bottleneck of a single shared memory.

How Does NUMA Work?

Non-Uniform Memory Access (NUMA) is a computer memory design that helps improve the efficiency and speed of your computer's processing power when dealing with multiple processors. Here’s a simplified breakdown of how NUMA works:

What is NUMA?

NUMA stands for Non-Uniform Memory Access. It's a setup in computers with multiple processors, where each processor has its own local memory that it accesses faster than non-local memory, or memory that belongs to another processor.

How Does NUMA Work?

Here’s a simplified step-by-step explanation:

Memory Partitioning: The computer's memory is divided into portions, with each portion directly connected to a specific processor.
Processor Awareness: Each processor knows which portion of memory is its own (local) and which portions are connected to other processors (remote).
Accessing Memory: Processors first check their local memory for data. If it’s not there, they check the remote memory. Accessing local memory is faster.
Managing Data Flow: When data must be accessed from remote memory, it travels through a high-speed link that connects the processors, but it's still slower than accessing local memory.
Optimising Performance: Techniques are used to ensure that processors access their local memory as much as possible, reducing the need for slower, remote memory access.
Scalability: As more processors are added to the system, NUMA helps manage the memory among them efficiently, keeping the system fast.

Importance of NUMA

In environments where large-scale, complex computing tasks are common—like in scientific research, financial modeling, or data analysis—NUMA can significantly enhance performance by reducing the time processors spend waiting for data. This setup is crucial for high-performance computing (HPC) where time and speed are critical.

In a typical server with a , each socket can have one or more , and each die can represent multiple NUMA domains.

How Sockets and Dies Relate to NUMA

NUMA, or Non-Uniform Memory Access, is a system design that optimises the use of memory by grouping cores and their nearest memory together into "NUMA nodes."

Each NUMA node offers faster access to its own local memory than to the memory local to other nodes, enhancing performance for workloads that can utilise local memory effectively.

Challenges with Multi-Socket and Multi-Die Configurations

When you have a server with multiple sockets and possibly multiple dies within those sockets, the complexity increases:

NUMA Domains: Each die might represent one or more NUMA domains, depending on its design and the memory connected to it. The more dies and sockets you have, the more NUMA domains there are.

Data Travel: In multi-socket, multi-die environments, data might need to travel across different NUMA domains to be processed. For example, if a processor on one die needs data that is stored in the memory local to another die (or another socket altogether), this data must travel through the system's interconnects (like buses or fabric) to reach the requesting processor.

Latency Issues: Each time data travels across these NUMA domains, it incurs latency. The farther the data has to travel—especially across sockets—the longer it takes. This is because accessing local memory is always faster than accessing remote memory associated with other dies or sockets.

Performance Impact: Applications sensitive to memory access times can experience performance bottlenecks in such a setup. This is because the speed at which these applications run can be significantly affected by how quickly they can access the necessary data.

So while multi-socket and multi-die configurations provide more processing power and the ability to handle more tasks, they also introduce challenges in terms of memory access efficiency.

Understanding and optimising the layout of sockets, dies, and NUMA domains is key to maximising the efficiency of such systems.

NVLink-C2C Interconnect: Alleviating NUMA Bottlenecks

The NVLink-C2C interconnect is a high-speed, direct connection technology developed by NVIDIA that provides a substantial bandwidth of 900 GB/s between chips.

Here’s how NVLink-C2C addresses and alleviates NUMA bottlenecks:

High Bandwidth Communication: By offering 900 GB/s, NVLink-C2C allows for faster data transfer between the cores and memory across different NUMA nodes. This high-speed data transfer capability is critical for workloads that require significant memory bandwidth and where data needs to be moved frequently and rapidly across processor nodes.

Simplified Memory Topology: The Grace CPU Superchip uses a straightforward memory topology with only two NUMA nodes. This simplicity means that there are fewer "hops" for the data to make when moving from one processor or memory node to another, reducing latency and the complexity of memory access patterns.

Direct Chip-to-Chip Communication: Unlike traditional interconnects that may require data to pass through multiple controllers or bridges, NVLink-C2C provides a direct pathway between chips. This setup not only speeds up data transfer but also minimises the latency typically associated with complex routing through different motherboard components.

Application Performance Improvement: For application developers, this means easier optimisation for NUMA architectures, as the reduced number of NUMA nodes simplifies the logic for distributing and accessing data. Applications can perform more efficiently due to reduced waiting times for data and increased overall throughput.

Scale Cores and Bandwidth with NVIDIA Scalable Coherency Fabric

The NVIDIA Scalable Coherency Fabric (SCF) is an architectural component used to manage the movement of data across different parts of a computing system, particularly in high-performance CPUs like the NVIDIA Grace CPU Superchip.

It plays a role in maintaining data coherence and performance scalability across an increasing number of cores and higher bandwidth demands.

Here's a breakdown of how it functions and its importance:

Functionality of Scalable Coherency Fabric

Mesh Network

SCF uses a mesh network topology, which interconnects multiple components like CPU cores, memory, and I/O devices through a grid-like pattern. This setup facilitates efficient data transfer across various points without overloading any single connection path.

Distributed Cache Architecture

The fabric incorporates a distributed cache system, particularly , which is shared among the CPU cores. This cache stores frequently accessed data close to the processor cores to reduce latency and improve speed when accessing this data.

Cache Switch Nodes

Within the SCF, Cache Switch Nodes play a pivotal role. They act as routers within the mesh network, directing data between the CPU cores, the cache memory, and the system's input/output operations. These nodes ensure that data flows efficiently across the system, managing both the routing and the coherence of data.

High Bandwidth

SCF supports extremely high bi-section bandwidth, with capabilities exceeding 3.2 terabytes per second. This high bandwidth is essential for handling the vast amounts of data processed in complex computations and maintaining system performance without bottlenecks.

The NVIDIA Scalable Coherency Fabric is a key architectural feature in NVIDIA's advanced CPU designs, providing a scalable, coherent, and efficient method to manage data flow and cache usage across an extensive array of cores and system components. This fabric ensures that as systems grow more complex, they continue to operate efficiently and coherently.

LPDDR5X Memory

Low-Power Double Data Rate 5X (LPDDR5X) memory is an advanced version of the standard DDR memory used primarily in servers and high-performance computing systems.

LPDDR5X is engineered to meet the demands of applications requiring high bandwidth and low power consumption, such as large-scale artificial intelligence (AI) and high-performance computing (HPC) workloads.

How LPDDR5X Memory Works

Enhanced Data Rate

LPDDR5X can transmit more data per clock cycle compared to its predecessors. This is achieved through more efficient data bus utilisation and higher memory clock speeds, leading to increased overall bandwidth. This means that more data can be processed faster, which is crucial for memory-intensive applications.

Low Power Consumption

LPDDR5X implements several features to reduce power usage, including a more refined manufacturing process that allows for lower voltage operations and improved I/O signalling techniques that decrease power draw during data transmission. This low power consumption is essential for reducing overall system energy costs and improving energy efficiency, especially in data canters where power cost can be a significant portion of operational expenses.

Error Correction Code (ECC)

ECC within LPDDR5X helps to ensure data integrity by correcting errors that occur during data transfer. This is particularly important in environments where data corruption can lead to significant losses or inaccuracies, such as in financial computing or scientific research.

LPDDR5X in the NVIDIA Grace CPU Superchip strikes an optimal balance between high performance (through increased bandwidth), lower power consumption, and cost efficiency.

Its implementation supports the demands of next-generation computing applications, providing a robust foundation for advancements in AI and HPC fields.

Memory Partitioning and Monitoring (MPAM)

The NVIDIA Grace CPU incorporates Arm's Memory Partitioning and Monitoring (MPAM) technology, which is designed to enhance the control and monitoring of memory and cache resources in a multi-tenant environment.

This feature is especially useful in data centres and for applications that require strong isolation between different tasks or jobs to prevent them from interfering with each other's performance.

Detailed explanation of how MPAM works

Memory Partitioning

Partitioning System Cache and Memory Resources

MPAM allows the system to divide its cache and memory resources into partitions. Each partition can be allocated to different jobs or applications running on the system. This ensures that each job has access to its designated resources without being affected by the resource demands of other jobs.

Ensuring Performance Isolation

By partitioning the resources, MPAM ensures that the performance of one job does not suffer because another job is consuming an excessive amount of cache or memory resources. This is important in environments where multiple applications or users are sharing the same physical hardware, as it maintains stable and predictable performance.

Monitoring and Management

SCF Cache Support for Partitioning

The NVIDIA-designed Scalable Coherency Fabric (SCF) Cache extends the capabilities of MPAM by supporting the partitioning of cache capacity. It allows for the allocation of specific portions of the cache to different jobs, further enhancing the ability to control and isolate system resources.

Partitioning of I/O and Memory Bandwidth

In addition to cache capacity, MPAM and SCF together manage the partitioning of I/O bandwidth and memory bandwidth. This means that each job can have a designated amount of bandwidth, preventing scenarios where a bandwidth-heavy job could starve other jobs of the bandwidth they need to perform effectively.

Performance Monitoring Groups (PMGs)

Monitoring Resource Utilisation

MPAM employs Performance Monitor Groups (PMGs) to keep track of how resources are being used by different jobs. PMGs can monitor various metrics, such as cache storage usage and memory bandwidth utilisation. This monitoring is vital for system administrators to understand performance dynamics and to make informed decisions about resource allocation.

Insights for Optimisation

The data collected by PMGs help in identifying bottlenecks or inefficiencies in resource usage. Administrators can use this information to optimise the system for better performance and resource utilisation, adjusting partitions and allocations based on the actual needs of different jobs.

Benefits

Improved System Efficiency: By ensuring that resources are not monopolised by a single job, MPAM helps in maintaining high overall system efficiency.
Enhanced Security and Isolation: Resource partitioning also enhances security and isolation between different tenants or jobs, which is critical in multi-user environments.
Flexibility and Scalability: MPAM provides the flexibility to adjust resource allocations in response to changing workloads, making systems more adaptable and scalable.

The integration of Arm’s MPAM in the NVIDIA Grace CPU enables sophisticated management and monitoring of memory and cache resources, ensuring that the system can handle multiple concurrent jobs efficiently and securely.

This technology is particularly beneficial in high-performance computing and data centre environments where resource isolation and performance predictability are crucial.

NVIDIA Grace CPU Superchip Specifications

Feature

Specification

Core Count

144 Arm Neoverse V2 Cores with 4x128b SVE2

Cache

L1: 64KB i-cache + 64KB d-cache per core

L2: 1MB per core

L3: 228MB total

Base Frequency

3.1 GHz

All-Core SIMD Frequency

3.0 GHz

Memory

LPDDR5X sizes: 240GB, 480GB, 960GB options

Memory Bandwidth

Up to 768 GB/s for 960GB memory

Up to 1024 GB/s for 240GB, 480GB memory

NVLink-C2C Bandwidth

900GB/s

PCIe Links

Up to 8x PCIe Gen5 x16, with bifurcation options

Module Thermal Design Power

500W TDP with memory

Form Factor

Superchip module

Thermal Solution

Air cooled or liquid cooled

Overview and Impact

The Grace CPU Superchip is engineered to tackle the most demanding data centre and HPC environments, providing up to twice the performance per watt compared to current x86 platforms.

It simplifies the architecture of data centres by integrating critical components which traditionally resided in multiple server units into a single chip.

This integration not only boosts power efficiency but also enhances the density and simplifies the system design.

This CPU is particularly advantageous for applications requiring intensive computational power such as deep learning, scientific computation, and real-time data analytics.

With its robust suite of technologies, including the advanced NVSwitch and ECC memory, the Grace CPU Superchip sets a new standard for data centre CPUs, ensuring that enterprises can handle expansive workloads with greater efficiency and reliability.

The NVIDIA Grace CPU Superchip represents a significant leap forward in data centre processing, delivering unparalleled performance and efficiency that align with the needs of modern enterprises and research institutions.

PreviousNVIDIA Grace Hopper Superchip NextNVIDIA GB200 NVL72

Last updated 1 year ago

Was this helpful?