NVIDIA Grace CPU Superchip
Last updated
Copyright Continuum Labs - 2023
Last updated
The NVIDIA Grace CPU Superchip marks a significant advance in data centre CPUs.
It's designed specifically for the intensive demands of modern cloud, enterprise, high-performance computing (HPC), and various other computational-intensive tasks.
The Grace CPU offers a revolutionary approach in its architecture, providing superior performance efficiency, thereby redefining cost-effectiveness and operational efficiency in data centres.
The Arm architecture in the NVIDIA Grace CPU Superchip, specifically the Neoverse V2 cores, incorporates several advanced features to meet the high-performance and efficiency demands of data centre CPUs.
This is an extension of the Armv8-A architecture, up to Armv8.5-A.
The Grace CPU supports application binaries built for Armv8 through Armv8.5-A, ensuring backward compatibility with CPUs like Ampere Altra, AWS Graviton2, and AWS Graviton3.
SIMD (Single Instruction Multiple Data) is a technique that allows a single instruction to perform the same operation on multiple data elements simultaneously, improving performance for certain types of workloads.
The Grace CPU supports two SIMD instruction sets: SVE2 (Scalable Vector Extension version 2) and NEON (Advanced SIMD).
SVE2 is a newer and more advanced SIMD extension that allows for variable-length vector operations, enabling better performance and flexibility compared to fixed-length SIMD architectures.
NEON is a well-established SIMD extension that has been widely used in Arm-based processors for multimedia and signal processing applications.
By supporting both SVE2 and NEON, the Grace CPU allows more software to take advantage of SIMD optimisations, resulting in improved performance for suitable workloads.
Atomic operations are indivisible operations that ensure data consistency in multi-threaded or multi-processor environments.
The Large System Extension (LSE) in the Grace CPU provides hardware support for low-cost atomic operations.
LSE improves system throughput by optimising common synchronisation primitives likeand , which are used for coordinating access to shared resources between CPUs.
With LSE, the Grace CPU can efficiently handle CPU-to-CPU communication and synchronisation, leading to better overall system performance in multi-processor setups.
Cryptographic acceleration: Enhances the performance of cryptographic algorithms.
Scalable profiling extension: Provides tools for detailed performance analysis.
Virtualization extensions: Improve the efficiency and security of virtualised environments.
Full memory encryption and secure boot: Enhance the security of data and the integrity of the boot process.
The Grace CPU Superchip is optimised for a range of high-performance computing and data centre applications.
It excels in environments where rapid access to large amounts of data is necessary, and its high memory bandwidth supports complex computational tasks efficiently. This makes it particularly well-suited for scientific simulations, large-scale data analytics, and machine learning workloads.
In summary, the Arm architecture in the NVIDIA Grace CPU Superchip provides a robust foundation for building and running high-performance, energy-efficient applications in modern data centres.
Its support for advanced SIMD operations, atomic instructions, and high-speed interconnects, along with comprehensive backward compatibility and security features, positions it as a powerful solution for the most demanding computational tasks.
The NVIDIA Grace CPU benefits from a rich and mature software ecosystem, which is an important reason for its adoption and usability across various domains.
The extensive software support ensures that users can seamlessly transition to the Grace CPU platform without the need for significant modifications to their existing software stack.
Compatibility with major Linux distributions is a key advantage, as Linux is the predominant operating system in data centres, high-performance computing (HPC), and cloud environments.
This compatibility allows users to leverage the vast collection of software packages, libraries, and tools available in these distributions, making it easier to deploy and manage applications on the Grace CPU.
The Grace CPU ecosystem also includes a wide range of development tools, such as compilers, libraries, profilers, and system administration utilities. These tools are essential for developers to build, optimise, and debug their applications effectively.
The importance of this extensive software ecosystem cannot be overstated. It enables users to leverage their existing skills, knowledge, and codebase, reducing the learning curve and time-to-deployment when adopting the Grace CPU.
The ecosystem also fosters collaboration and innovation, as developers can build upon existing tools and libraries to create new applications and solutions.
Programming the NVIDIA Grace CPU is straightforward and flexible, thanks to the comprehensive toolchain support.
Developers can choose from a variety of programming languages and paradigms based on their preferences and the requirements of their applications.
For applications built using interpreted or Just-in-Time (JIT) compiled languages like Python, Java, PHP, and Node.js, the Grace CPU provides seamless compatibility.
These applications can run on the Grace CPU without any modifications, as the interpreters and runtimes for these languages are readily available on Arm-based systems.
Compiled applications, written in languages such as C, C++, and Fortran, can also be easily ported to the Grace CPU.
Existing application binaries compiled for Armv8 or later architectures can run on the Grace CPU without the need for recompilation.
However, to take full advantage of the Grace CPU's capabilities and maximise performance, developers can recompile their applications using compilers that support the Armv9 Instruction Set Architecture (ISA) and optimise for the Neoverse V2 microarchitecture.
The Grace CPU is supported by a wide range of compilers, including:
GCC (GNU Compiler Collection): A popular open-source compiler suite that supports multiple languages and architectures, including Arm.
LLVM: A modular and extensible compiler framework that provides a collection of tools and libraries for building compilers and related tools.
NVHPC (NVIDIA HPC Compilers): NVIDIA's suite of compilers optimised for NVIDIA hardware, enabling high-performance computing on the Grace CPU.
Arm Compiler for Linux: Arm's proprietary compiler suite, specifically designed for Arm-based systems, offering advanced optimisations and performance tuning.
HPE Cray Compilers: A set of compilers optimized for HPC workloads, with support for the Grace CPU.
NVIDIA's Nsight family of performance analysis tools is particularly noteworthy for developers working with the Grace CPU.
Nsight Systems and Nsight Compute provide deep insights into application behavior, allowing developers to identify performance bottlenecks, visualise GPU and CPU utilisation, and optimise resource usage.
These tools seamlessly integrate with the NVIDIA software ecosystem, supporting CUDA, OpenMP, and other parallel programming models.
The extensive programmability and toolchain support for the NVIDIA Grace CPU empowers developers to create high-performance, scalable, and efficient applications across various domains.
By leveraging the available compilers, libraries, and tools, developers can unlock the full potential of the Grace CPU and accelerate their application development and optimisation processes.
The NVIDIA Grace CPU Superchip employs the NVLink Chip-to-Chip (C2C) interconnect technology to effectively manage and enhance data communication between multiple processing units, tackling common issues found in traditional multi-socket server architectures related to data bottlenecks and inefficient memory access patterns.
This technology enables high-speed connectivity between chips, facilitating faster data transfer rates which are essential for performance in high-compute environments.
This provides high bandwidth (900 GB/s) connections between chips, facilitating fast data transfer and reducing latency in multi-chip configurations.
NUMA is a server architecture used in multi-processor systems where the memory access time depends on the memory location relative to a processor.
In NUMA, each processor (or group of processors) has its own local memory, and accessing memory across processors takes more time than accessing local memory. This architecture is designed to scale the performance of high-end servers by minimising the bottleneck of a single shared memory.
In a typical server with a , each socket can have one or more , and each die can represent multiple NUMA domains.
NUMA, or Non-Uniform Memory Access, is a system design that optimises the use of memory by grouping cores and their nearest memory together into "NUMA nodes."
Each NUMA node offers faster access to its own local memory than to the memory local to other nodes, enhancing performance for workloads that can utilise local memory effectively.
When you have a server with multiple sockets and possibly multiple dies within those sockets, the complexity increases:
NUMA Domains: Each die might represent one or more NUMA domains, depending on its design and the memory connected to it. The more dies and sockets you have, the more NUMA domains there are.
Data Travel: In multi-socket, multi-die environments, data might need to travel across different NUMA domains to be processed. For example, if a processor on one die needs data that is stored in the memory local to another die (or another socket altogether), this data must travel through the system's interconnects (like buses or fabric) to reach the requesting processor.
Latency Issues: Each time data travels across these NUMA domains, it incurs latency. The farther the data has to travel—especially across sockets—the longer it takes. This is because accessing local memory is always faster than accessing remote memory associated with other dies or sockets.
Performance Impact: Applications sensitive to memory access times can experience performance bottlenecks in such a setup. This is because the speed at which these applications run can be significantly affected by how quickly they can access the necessary data.
So while multi-socket and multi-die configurations provide more processing power and the ability to handle more tasks, they also introduce challenges in terms of memory access efficiency.
Understanding and optimising the layout of sockets, dies, and NUMA domains is key to maximising the efficiency of such systems.
The NVLink-C2C interconnect is a high-speed, direct connection technology developed by NVIDIA that provides a substantial bandwidth of 900 GB/s between chips.
Here’s how NVLink-C2C addresses and alleviates NUMA bottlenecks:
High Bandwidth Communication: By offering 900 GB/s, NVLink-C2C allows for faster data transfer between the cores and memory across different NUMA nodes. This high-speed data transfer capability is critical for workloads that require significant memory bandwidth and where data needs to be moved frequently and rapidly across processor nodes.
Simplified Memory Topology: The Grace CPU Superchip uses a straightforward memory topology with only two NUMA nodes. This simplicity means that there are fewer "hops" for the data to make when moving from one processor or memory node to another, reducing latency and the complexity of memory access patterns.
Direct Chip-to-Chip Communication: Unlike traditional interconnects that may require data to pass through multiple controllers or bridges, NVLink-C2C provides a direct pathway between chips. This setup not only speeds up data transfer but also minimises the latency typically associated with complex routing through different motherboard components.
Application Performance Improvement: For application developers, this means easier optimisation for NUMA architectures, as the reduced number of NUMA nodes simplifies the logic for distributing and accessing data. Applications can perform more efficiently due to reduced waiting times for data and increased overall throughput.
The NVIDIA Scalable Coherency Fabric (SCF) is an architectural component used to manage the movement of data across different parts of a computing system, particularly in high-performance CPUs like the NVIDIA Grace CPU Superchip.
It plays a role in maintaining data coherence and performance scalability across an increasing number of cores and higher bandwidth demands.
Here's a breakdown of how it functions and its importance:
Mesh Network
SCF uses a mesh network topology, which interconnects multiple components like CPU cores, memory, and I/O devices through a grid-like pattern. This setup facilitates efficient data transfer across various points without overloading any single connection path.
Distributed Cache Architecture
The fabric incorporates a distributed cache system, particularly , which is shared among the CPU cores. This cache stores frequently accessed data close to the processor cores to reduce latency and improve speed when accessing this data.
Cache Switch Nodes
Within the SCF, Cache Switch Nodes play a pivotal role. They act as routers within the mesh network, directing data between the CPU cores, the cache memory, and the system's input/output operations. These nodes ensure that data flows efficiently across the system, managing both the routing and the coherence of data.
High Bandwidth
SCF supports extremely high bi-section bandwidth, with capabilities exceeding 3.2 terabytes per second. This high bandwidth is essential for handling the vast amounts of data processed in complex computations and maintaining system performance without bottlenecks.
The NVIDIA Scalable Coherency Fabric is a key architectural feature in NVIDIA's advanced CPU designs, providing a scalable, coherent, and efficient method to manage data flow and cache usage across an extensive array of cores and system components. This fabric ensures that as systems grow more complex, they continue to operate efficiently and coherently.
Low-Power Double Data Rate 5X (LPDDR5X) memory is an advanced version of the standard DDR memory used primarily in servers and high-performance computing systems.
LPDDR5X is engineered to meet the demands of applications requiring high bandwidth and low power consumption, such as large-scale artificial intelligence (AI) and high-performance computing (HPC) workloads.
LPDDR5X in the NVIDIA Grace CPU Superchip strikes an optimal balance between high performance (through increased bandwidth), lower power consumption, and cost efficiency.
Its implementation supports the demands of next-generation computing applications, providing a robust foundation for advancements in AI and HPC fields.
The NVIDIA Grace CPU incorporates Arm's Memory Partitioning and Monitoring (MPAM) technology, which is designed to enhance the control and monitoring of memory and cache resources in a multi-tenant environment.
This feature is especially useful in data centres and for applications that require strong isolation between different tasks or jobs to prevent them from interfering with each other's performance.
Partitioning System Cache and Memory Resources
MPAM allows the system to divide its cache and memory resources into partitions. Each partition can be allocated to different jobs or applications running on the system. This ensures that each job has access to its designated resources without being affected by the resource demands of other jobs.
Ensuring Performance Isolation
By partitioning the resources, MPAM ensures that the performance of one job does not suffer because another job is consuming an excessive amount of cache or memory resources. This is important in environments where multiple applications or users are sharing the same physical hardware, as it maintains stable and predictable performance.
SCF Cache Support for Partitioning
The NVIDIA-designed Scalable Coherency Fabric (SCF) Cache extends the capabilities of MPAM by supporting the partitioning of cache capacity. It allows for the allocation of specific portions of the cache to different jobs, further enhancing the ability to control and isolate system resources.
Partitioning of I/O and Memory Bandwidth
In addition to cache capacity, MPAM and SCF together manage the partitioning of I/O bandwidth and memory bandwidth. This means that each job can have a designated amount of bandwidth, preventing scenarios where a bandwidth-heavy job could starve other jobs of the bandwidth they need to perform effectively.
Monitoring Resource Utilisation
MPAM employs Performance Monitor Groups (PMGs) to keep track of how resources are being used by different jobs. PMGs can monitor various metrics, such as cache storage usage and memory bandwidth utilisation. This monitoring is vital for system administrators to understand performance dynamics and to make informed decisions about resource allocation.
Insights for Optimisation
The data collected by PMGs help in identifying bottlenecks or inefficiencies in resource usage. Administrators can use this information to optimise the system for better performance and resource utilisation, adjusting partitions and allocations based on the actual needs of different jobs.
Improved System Efficiency: By ensuring that resources are not monopolised by a single job, MPAM helps in maintaining high overall system efficiency.
Enhanced Security and Isolation: Resource partitioning also enhances security and isolation between different tenants or jobs, which is critical in multi-user environments.
Flexibility and Scalability: MPAM provides the flexibility to adjust resource allocations in response to changing workloads, making systems more adaptable and scalable.
The integration of Arm’s MPAM in the NVIDIA Grace CPU enables sophisticated management and monitoring of memory and cache resources, ensuring that the system can handle multiple concurrent jobs efficiently and securely.
This technology is particularly beneficial in high-performance computing and data centre environments where resource isolation and performance predictability are crucial.
Feature | Specification |
Core Count | 144 Arm Neoverse V2 Cores with 4x128b SVE2 |
Cache | L1: 64KB i-cache + 64KB d-cache per core |
L2: 1MB per core | |
L3: 228MB total | |
Base Frequency | 3.1 GHz |
All-Core SIMD Frequency | 3.0 GHz |
Memory | LPDDR5X sizes: 240GB, 480GB, 960GB options |
Memory Bandwidth | Up to 768 GB/s for 960GB memory |
Up to 1024 GB/s for 240GB, 480GB memory | |
NVLink-C2C Bandwidth | 900GB/s |
PCIe Links | Up to 8x PCIe Gen5 x16, with bifurcation options |
Module Thermal Design Power | 500W TDP with memory |
Form Factor | Superchip module |
Thermal Solution | Air cooled or liquid cooled |
The Grace CPU Superchip is engineered to tackle the most demanding data centre and HPC environments, providing up to twice the performance per watt compared to current x86 platforms.
It simplifies the architecture of data centres by integrating critical components which traditionally resided in multiple server units into a single chip.
This integration not only boosts power efficiency but also enhances the density and simplifies the system design.
This CPU is particularly advantageous for applications requiring intensive computational power such as deep learning, scientific computation, and real-time data analytics.
With its robust suite of technologies, including the advanced NVSwitch and ECC memory, the Grace CPU Superchip sets a new standard for data centre CPUs, ensuring that enterprises can handle expansive workloads with greater efficiency and reliability.
The NVIDIA Grace CPU Superchip represents a significant leap forward in data centre processing, delivering unparalleled performance and efficiency that align with the needs of modern enterprises and research institutions.