NVMe (Non-Volatile Memory Express)
Last updated
Copyright Continuum Labs - 2023
Last updated
We provide a comprehensive overview of NVMe (Non-Volatile Memory Express) and NVMe 2.0, focusing on their key features, benefits, and use cases.
NVMe (Non-Volatile Memory Express) is a host controller interface and storage protocol designed specifically for solid-state drives (SSDs) and other non-volatile memory devices.
It was developed to fully leverage the benefits of flash-based storage and modern computer architectures, providing significant improvements in performance, efficiency, and scalability compared to older storage protocols like and .
NVMe 2.0 builds upon the success of NVMe, introducing significant enhancements and new features to address the growing complexity and diversity of storage systems.
The documentation aims to provide a clear understanding of NVMe and NVMe 2.0, their architectural improvements, and how they can be leveraged to create optimised storage solutions, particularly for GPU clusters and AI workloads.
Architecture
NVMe is designed from the ground up for SSDs and PCIe-based systems.
It uses a streamlined register interface, command set, and queue design that are optimised for the low latency and parallelism of flash storage. This allows NVMe to deliver high throughput and low latency with minimal overhead.
NVMe is primarily designed to work over PCI Express (PCIe), which provides high-speed, low-latency, direct access between the CPU/memory and storage devices. This eliminates the need for a separate storage controller, reducing latency and improving performance.
NVMe supports up to 64K I/O queues, each with up to 64K entries. This massive parallelism allows NVMe to scale with the increasing core counts of modern CPUs and handle large numbers of concurrent I/O requests efficiently.
While PCIe is the primary transport for NVMe, the protocol is designed to be transport-agnostic. NVMe can be used over other interconnects like Ethernet, InfiniBand, and Fibre Channel, enabling fast storage access over networks (known as NVMe-oF or NVMe over Fabrics).
Operating systems like Windows, Linux, and VMware have native NVMe drivers, making adoption straightforward. Many enterprise storage and virtualisation platforms also have built-in support for NVMe.
Lower latency and higher throughput compared to and
Reduced CPU overhead and improved performance due to streamlined protocol
Scalability to handle massive amounts of data and large numbers of concurrent requests
Flexibility to work over various transports (PCIe, Ethernet, etc.)
Enabler for new applications and use cases that require fast, low-latency storage
In summary, NVMe is a modern, efficient, and high-performance storage protocol that unlocks the full potential of flash storage and modern computer architectures.
Its widespread adoption in enterprise, cloud, and consumer markets is driven by the ever-increasing demand for faster data access and more efficient storage solutions.
NVMe 2.0 is a significant evolution of the NVMe (Non-Volatile Memory Express) specification, designed to address the growing complexity and diversity of modern storage systems.
NVMe 2.0 introduces a major restructuring of the specifications, making them more modular and easier to develop and maintain.
The base specification now focuses on the core NVMe architecture and command set, while separate specifications are created for different command sets (e.g., NVM Command Set, Zoned Namespaces Command Set, Key Value Command Set) and transports (e.g., PCIe, RDMA, TCP).
This modular approach allows for faster development, easier innovation, and better maintainability of the specifications.
NVMe 2.0 introduces a new mechanism for supporting up to 64 I/O command sets, compared to the previous limit of 8.
Each namespace is associated with a specific command set, and a single NVMe subsystem can support namespaces with different command sets simultaneously.
This flexibility enables the development of specialised command sets for different use cases, such as the new Zoned Namespaces (ZNS) and Key Value (KV) command sets.
In NVMe 2.0, the base specification focuses on the core NVMe architecture and command set, while separate specifications are created for different command sets and transports. This modular approach offers several benefits:
Flexibility: By separating command sets into distinct specifications, NVMe 2.0 allows for the development of specialised command sets tailored to specific use cases. This flexibility enables NVMe to adapt to the diverse needs of different applications and storage technologies.
Maintainability: Having separate specifications for each command set makes it easier to maintain and update them independently. This allows for faster innovation and evolution of individual command sets without impacting the core NVMe architecture.
Simplified Development: Developers working on a specific command set or transport can focus on the relevant specification without having to navigate through the entire NVMe specification. This simplifies the development process and reduces the likelihood of errors.
Example: Consider a developer working on a Zoned Namespaces (ZNS) implementation. With separate specifications, they can focus solely on the ZNS Command Set specification, which provides the necessary information for implementing ZNS functionality without having to worry about other aspects of the NVMe architecture.
NVMe 2.0 introduces support for up to 64 I/O command sets, a significant increase from the previous limit of 8.
Each namespace is associated with a specific command set, and a single NVMe subsystem can support namespaces with different command sets simultaneously. This enhancement offers several advantages:
Specialised Functionality: Different command sets can be designed to optimise for specific storage technologies or data access patterns. For example, the Zoned Namespaces (ZNS) command set is optimised for SSDs using NAND flash memory, while the Key Value (KV) command set is designed for unstructured data.
Efficient Resource Utilisation: By associating namespaces with specific command sets, NVMe 2.0 allows for more efficient utilisation of storage resources. Each namespace can be configured with the command set that best suits its requirements, enabling optimal performance and resource usage.
Flexibility in System Design: The ability to support multiple command sets within a single NVMe subsystem provides greater flexibility in system design. Storage architects can mix and match namespaces with different command sets to create heterogeneous storage solutions tailored to specific workloads.
Example: An NVMe subsystem in a data center could have some namespaces configured with the KV command set for handling unstructured data, while other namespaces use the traditional NVM command set for block-based storage. This allows the system to efficiently handle diverse data types and access patterns.
Zoned Namespaces (ZNS) is a new command set introduced in NVMe 2.0 that is optimised specifically for solid-state drives (SSDs) using NAND flash memory.
It aims to address the unique characteristics and challenges associated with NAND flash, such as write amplification, over-provisioning, and the need for efficient mapping of logical addresses to physical locations.
ZNS organizes data into zones, which are contiguous regions of logical block addresses (LBAs) that must be written sequentially.
Each zone has a fixed size and is associated with a specific range of LBAs.
Zones can be in different states, such as Empty, Implicitly Open, Explicitly Open, Closed, Full, and Read-Only.
The host must write data to a zone sequentially, starting from the lowest LBA and progressing towards the highest LBA within the zone.
Once a zone is closed or marked as full, it cannot be written to again until it is reset.
ZNS enforces a sequential write requirement within each zone, meaning that data must be written in a contiguous manner without skipping or overwriting LBAs.
This sequential write requirement aligns with the inherent characteristics of NAND flash memory, which is organised in pages and blocks.
Writing data sequentially minimises the need for garbage collection and reduces write amplification, as it avoids the need to constantly relocate and rewrite data.
Sequential writes also enable more efficient use of the NAND flash media, as it reduces the number of program/erase cycles required.
Write amplification occurs when the actual amount of data written to the NAND flash is greater than the amount of data requested by the host.
In conventional SSDs, write amplification is caused by factors such as garbage collection, wear leveling, and the need to maintain a mapping table between logical and physical addresses.
By enforcing sequential writes within zones, ZNS minimises write amplification, as it reduces the need for frequent garbage collection and data relocation.
This leads to improved write performance, reduced wear on the NAND flash, and increased SSD endurance.
Over-provisioning refers to the practice of reserving a portion of the SSD's raw capacity for internal operations, such as garbage collection and wear leveling.
In conventional SSDs, a significant amount of over-provisioning is required to ensure efficient operation and maintain performance.
With ZNS, the sequential write requirement within zones reduces the need for extensive over-provisioning.
This allows for more usable capacity to be exposed to the host, as less space needs to be reserved for internal SSD operations.
In conventional SSDs, a mapping table is maintained to translate logical addresses (used by the host) to physical addresses (on the NAND flash).
As SSDs increase in capacity, the size of the mapping table grows, consuming more memory and computational resources.
ZNS simplifies the mapping table by leveraging the sequential write requirement within zones.
Instead of maintaining a mapping for each individual LBA, ZNS can use a more compact mapping at the zone level.
This reduces the memory footprint of the mapping table and improves the efficiency of address translation.
ZNS enables the host to have more control over data placement and management within the SSD.
The host can explicitly open and close zones, write data to specific zones, and track the state of each zone.
This host-managed approach allows for optimisations based on the specific workload and application requirements.
For example, the host can group related data into the same zone, enabling faster access and reducing the need for data movement.
By aligning with the sequential write nature of NAND flash and reducing write amplification, ZNS enables improved write performance compared to conventional SSDs.
The simplified mapping table and reduced over-provisioning also contribute to faster address translation and more efficient use of NAND flash media.
ZNS can lead to increased SSD endurance, as it minimises unnecessary write operations and reduces wear on the NAND flash cells.
This is particularly beneficial for write-intensive workloads, such as log storage, video recording, and continuous data capture.
ZNS is designed to work seamlessly with the NVMe interface and command set.
It introduces new commands and data structures specific to zoned namespaces, such as the Zone Management Send and Zone Management Receive commands.
These commands allow the host to discover and manage zones, retrieve zone information, and perform zone-specific operations.
ZNS leverages the existing NVMe infrastructure, including queues, interrupts, and transport mechanisms, ensuring compatibility and ease of integration.
ZNS is particularly well-suited for applications that generate large amounts of sequential write data, such as video surveillance, logging, and data analytics.
It can also benefit applications that require efficient storage and retrieval of large files, such as media streaming and backup systems.
ZNS enables higher storage density, improved performance, and reduced cost per gigabyte compared to conventional SSDs.
ZNS represents a significant advancement in SSD technology, addressing the specific characteristics and challenges of NAND flash memory.
By organising data into zones and enforcing sequential writes, ZNS reduces write amplification, over-provisioning, and the size of the mapping table. This leads to increased capacity, improved performance, and extended SSD endurance.
The host-managed approach of ZNS allows for optimisations based on specific workloads and application requirements. It enables more efficient use of NAND flash media and provides greater control over data placement and management.
As SSDs continue to evolve and increase in capacity, ZNS offers a scalable and efficient solution for managing and accessing data. It aligns with the inherent characteristics of NAND flash and leverages the benefits of the NVMe interface to deliver high-performance, cost-effective storage solutions.
Overall, ZNS represents a significant step forward in SSD technology, enabling more efficient and optimized use of NAND flash memory. It empowers storage systems to meet the growing demands of data-intensive applications while improving performance, endurance, and capacity utilization.
The Key Value Command Set introduces a new set of commands specifically designed for handling key-value pairs, which are commonly used in applications like databases and large-scale web services.
It defines data structures for representing key-value pairs, including formats for storing and accessing keys and values.
The command set includes operations such as Store, Retrieve, Delete, and Exist, which allow for efficient manipulation and querying of key-value pairs. It also defines additional status values and log pages specific to key-value operations, providing feedback and diagnostics to the host.
The KV command set in NVMe 2.0 is designed to efficiently handle unstructured data by allowing the host to access data using a key-value pair instead of logical block addresses.
This approach offers several benefits:
Simplified Data Access: With the KV command set, the host can directly access data using a unique key, eliminating the need to maintain a translation table that maps keys to logical block addresses. This simplifies the data access process and reduces overhead.
Reduced Metadata Overhead: By using key-value pairs, the KV command set eliminates the need for the host to manage and maintain a separate metadata structure. The key itself serves as the metadata, reducing the overall metadata overhead.
Efficient Unstructured Data Management: Unstructured data, such as documents, images, or videos, often have varying sizes and formats. The KV command set allows for efficient storage and retrieval of such data by using keys as identifiers, making it well-suited for object storage and NoSQL databases.
Example: Consider a large-scale object storage system storing user-generated content, such as photos and videos. With the KV command set, each object can be stored and retrieved using a unique key, such as a user ID or a timestamp. This allows for fast and efficient access to specific objects without the need for complex mapping tables.
NVMe 2.0 introduces the concept of endurance groups, which allows for more granular control over the allocation of media resources.
An endurance group represents a portion of the non-volatile memory in an NVMe subsystem that can be managed as a unit. By configuring and managing endurance groups, storage administrators can optimise performance and endurance based on the specific requirements of different applications or data types.
Benefits of Endurance Group Management:
Resource Allocation: Endurance groups allow for the allocation of media resources to specific applications or data types. This enables administrators to prioritise critical workloads and ensure they have the necessary resources to meet performance and endurance requirements.
Wear Leveling: By managing endurance groups separately, administrators can implement targeted wear leveling strategies. This helps distribute the wear across the media, extending the overall lifespan of the storage devices.
Quality of Service (QoS): Endurance groups can be assigned different QoS parameters, such as performance limits or prioritisation levels. This allows for better control over the performance and resource allocation for different workloads.
Example: In a database environment, an administrator could create separate endurance groups for transaction logs and data files. The transaction log endurance group could be configured with higher performance and endurance requirements, while the data file endurance group could be optimised for capacity. This separation allows for optimal resource utilization and ensures the critical transaction logs have the necessary performance and reliability.
NVMe 2.0 adds support for rotational media, such as hard disk drives (HDDs), allowing them to be used in NVMe-based systems. This enhancement provides several benefits:
Unified Storage Architecture: With rotational media support, NVMe can serve as a common interface for both solid-state drives (SSDs) and HDDs. This enables a more unified storage architecture, simplifying system design and management.
Cost Optimisation: HDDs are generally less expensive than SSDs on a per-capacity basis. By supporting rotational media, NVMe 2.0 allows for the integration of cost-effective HDDs into NVMe-based systems, providing a balance between performance and cost.
Tiered Storage: NVMe 2.0's support for rotational media enables the implementation of tiered storage architectures. Critical or frequently accessed data can be stored on high-performance NVMe SSDs, while less performance-sensitive data can be stored on NVMe HDDs. This allows for optimal resource utilisation and cost efficiency.
Example: A large-scale data center could deploy NVMe-based storage systems that include both NVMe SSDs and NVMe HDDs. The SSDs could be used for hot data and caching, providing high-performance access to frequently accessed information. The HDDs could be used for cold data storage, offering cost-effective capacity for less frequently accessed data. This tiered approach maximises performance while minimising overall storage costs.
NVMe 2.0 introduces several new features and enhancements that can be combined to create highly optimised storage solutions for GPU clusters and AI workloads.
Here are some key benefits and use cases:
AI workloads often involve large amounts of unstructured data, such as images, videos, and text documents.
The KV Command Set in NVMe 2.0 is designed to efficiently handle unstructured data by allowing the host to access data using key-value pairs instead of logical block addresses.
With the KV Command Set, AI applications can store and retrieve unstructured data using unique keys, such as object IDs or timestamps, without the need for complex mapping tables.
This simplified data access and reduced metadata overhead can significantly improve the performance and scalability of AI workloads that deal with unstructured data.
For example, in a large-scale image recognition system, the KV Command Set can be used to store and retrieve individual images using their unique identifiers, enabling fast and efficient access to specific images during training and inference.
Many AI workloads, such as training datasets and log files, involve large amounts of sequential write operations.
Zoned Namespaces (ZNS) in NVMe 2.0 are optimised for sequential writes by organising data into zones that must be written sequentially.
By aligning the write patterns of AI workloads with the sequential write requirement of ZNS, write amplification can be minimised, reducing wear on the underlying storage media and improving write performance.
ZNS also enables more efficient use of storage capacity by reducing the need for over-provisioning, allowing more usable capacity for AI datasets.
In a GPU cluster environment, ZNS can be leveraged to store and manage large training datasets efficiently, ensuring optimal write performance and minimizing the impact on the limited storage resources.
AI workloads often have varying performance and endurance requirements for different types of data, such as frequently accessed model parameters and less frequently accessed historical data.
NVMe 2.0 introduces the concept of endurance groups, allowing for granular control over the allocation of storage media resources.
By creating separate endurance groups for different types of AI data and assigning appropriate quality of service (QoS) parameters, storage administrators can optimise performance and endurance based on the specific requirements of each data type.
For example, in a GPU cluster running multiple AI workloads, endurance groups can be created to prioritise the allocation of high-performance NVMe SSDs to critical model parameters and training data, while less performance-sensitive data can be stored on lower-cost NVMe HDDs.
AI workloads often require large amounts of storage capacity for datasets, trained models, and intermediate results.
NVMe 2.0 adds support for rotational media, such as hard disk drives (HDDs), allowing them to be used alongside NVMe SSDs in a unified storage architecture.
By leveraging the cost-effectiveness of HDDs for storing less frequently accessed or archived AI data, while using high-performance NVMe SSDs for active datasets and model training, storage costs can be optimised without compromising performance.
In a GPU cluster environment, a tiered storage approach using NVMe SSDs and NVMe HDDs can provide a balance between performance and capacity, enabling efficient storage utilization for AI workloads.
The increased flexibility and performance of NVMe 2.0 make it an ideal choice for GPU clusters and AI workloads.
The ability to support multiple command sets and optimise for specific use cases can significantly improve I/O performance and reduce latency, enabling faster data access and processing.
Features like the KV Command Set and ZNS can enable more efficient data access patterns, reducing overhead and increasing throughput.
The modular design of NVMe 2.0 specifications allows for easier integration and management of storage in large-scale AI environments, simplifying storage architectures and enabling seamless scalability.
By leveraging the new features and enhancements introduced in NVMe 2.0, GPU clusters and AI workloads can benefit from optimised storage solutions that deliver high performance, efficient resource utilisation, and cost-effectiveness.
As AI continues to evolve and demands for storage performance and capacity grow, NVMe 2.0 provides a solid foundation for building scalable and efficient storage solutions. Its flexibility, performance, and optimisations make it an ideal choice for GPU clusters and AI environments, enabling organisations to harness the full potential of their data and accelerate AI innovation.