LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Introduction to NVMe
  • Key points about NVMe
  • Benefits of NVMe include
  • NVMe 2.0
  • Key Differences and Improvements
  • Command Sets and Separate Specifications
  • Multiple I/O Command Sets and Namespaces
  • Zoned Namespaces (ZNS)
  • Key Value (KV) Command Set and Unstructured Data
  • Endurance Group Management
  • Rotational Media Support
  • Benefits and Use Cases of NVMe 2.0 for GPU Clusters and AI Workloads

Was this helpful?

  1. Infrastructure
  2. Networking and Connectivity

NVMe (Non-Volatile Memory Express)

PreviousNVIDIA ConnectX InfiniBand adaptersNextNVMe over Fabrics (NVMe-oF)

Last updated 11 months ago

Was this helpful?

Introduction to NVMe

We provide a comprehensive overview of NVMe (Non-Volatile Memory Express) and NVMe 2.0, focusing on their key features, benefits, and use cases.

NVMe (Non-Volatile Memory Express) is a host controller interface and storage protocol designed specifically for solid-state drives (SSDs) and other non-volatile memory devices.

It was developed to fully leverage the benefits of flash-based storage and modern computer architectures, providing significant improvements in performance, efficiency, and scalability compared to older storage protocols like and .

NVMe 2.0 builds upon the success of NVMe, introducing significant enhancements and new features to address the growing complexity and diversity of storage systems.

The documentation aims to provide a clear understanding of NVMe and NVMe 2.0, their architectural improvements, and how they can be leveraged to create optimised storage solutions, particularly for GPU clusters and AI workloads.

Key points about NVMe

Architecture

NVMe is designed from the ground up for SSDs and PCIe-based systems.

It uses a streamlined register interface, command set, and queue design that are optimised for the low latency and parallelism of flash storage. This allows NVMe to deliver high throughput and low latency with minimal overhead.

PCIe Interface

Scalability

NVMe supports up to 64K I/O queues, each with up to 64K entries. This massive parallelism allows NVMe to scale with the increasing core counts of modern CPUs and handle large numbers of concurrent I/O requests efficiently.

Multiple Transports

While PCIe is the primary transport for NVMe, the protocol is designed to be transport-agnostic. NVMe can be used over other interconnects like Ethernet, InfiniBand, and Fibre Channel, enabling fast storage access over networks (known as NVMe-oF or NVMe over Fabrics).

Software Ecosystem

Operating systems like Windows, Linux, and VMware have native NVMe drivers, making adoption straightforward. Many enterprise storage and virtualisation platforms also have built-in support for NVMe.

Benefits of NVMe include

  1. Lower latency and higher throughput compared to and

  2. Reduced CPU overhead and improved performance due to streamlined protocol

  3. Scalability to handle massive amounts of data and large numbers of concurrent requests

  4. Flexibility to work over various transports (PCIe, Ethernet, etc.)

  5. Enabler for new applications and use cases that require fast, low-latency storage

In summary, NVMe is a modern, efficient, and high-performance storage protocol that unlocks the full potential of flash storage and modern computer architectures.

Its widespread adoption in enterprise, cloud, and consumer markets is driven by the ever-increasing demand for faster data access and more efficient storage solutions.

NVMe 2.0

NVMe 2.0 is a significant evolution of the NVMe (Non-Volatile Memory Express) specification, designed to address the growing complexity and diversity of modern storage systems.

Key Differences and Improvements

Specification Refactoring

  • NVMe 2.0 introduces a major restructuring of the specifications, making them more modular and easier to develop and maintain.

  • The base specification now focuses on the core NVMe architecture and command set, while separate specifications are created for different command sets (e.g., NVM Command Set, Zoned Namespaces Command Set, Key Value Command Set) and transports (e.g., PCIe, RDMA, TCP).

  • This modular approach allows for faster development, easier innovation, and better maintainability of the specifications.

Multiple Command Sets

  • NVMe 2.0 introduces a new mechanism for supporting up to 64 I/O command sets, compared to the previous limit of 8.

  • Each namespace is associated with a specific command set, and a single NVMe subsystem can support namespaces with different command sets simultaneously.

  • This flexibility enables the development of specialised command sets for different use cases, such as the new Zoned Namespaces (ZNS) and Key Value (KV) command sets.

Command Sets and Separate Specifications

In NVMe 2.0, the base specification focuses on the core NVMe architecture and command set, while separate specifications are created for different command sets and transports. This modular approach offers several benefits:

Flexibility: By separating command sets into distinct specifications, NVMe 2.0 allows for the development of specialised command sets tailored to specific use cases. This flexibility enables NVMe to adapt to the diverse needs of different applications and storage technologies.

Maintainability: Having separate specifications for each command set makes it easier to maintain and update them independently. This allows for faster innovation and evolution of individual command sets without impacting the core NVMe architecture.

Simplified Development: Developers working on a specific command set or transport can focus on the relevant specification without having to navigate through the entire NVMe specification. This simplifies the development process and reduces the likelihood of errors.

Example: Consider a developer working on a Zoned Namespaces (ZNS) implementation. With separate specifications, they can focus solely on the ZNS Command Set specification, which provides the necessary information for implementing ZNS functionality without having to worry about other aspects of the NVMe architecture.

Multiple I/O Command Sets and Namespaces

NVMe 2.0 introduces support for up to 64 I/O command sets, a significant increase from the previous limit of 8.

Each namespace is associated with a specific command set, and a single NVMe subsystem can support namespaces with different command sets simultaneously. This enhancement offers several advantages:

Specialised Functionality: Different command sets can be designed to optimise for specific storage technologies or data access patterns. For example, the Zoned Namespaces (ZNS) command set is optimised for SSDs using NAND flash memory, while the Key Value (KV) command set is designed for unstructured data.

Efficient Resource Utilisation: By associating namespaces with specific command sets, NVMe 2.0 allows for more efficient utilisation of storage resources. Each namespace can be configured with the command set that best suits its requirements, enabling optimal performance and resource usage.

Flexibility in System Design: The ability to support multiple command sets within a single NVMe subsystem provides greater flexibility in system design. Storage architects can mix and match namespaces with different command sets to create heterogeneous storage solutions tailored to specific workloads.

Example: An NVMe subsystem in a data center could have some namespaces configured with the KV command set for handling unstructured data, while other namespaces use the traditional NVM command set for block-based storage. This allows the system to efficiently handle diverse data types and access patterns.

Zoned Namespaces (ZNS)

Zoned Namespaces (ZNS) is a new command set introduced in NVMe 2.0 that is optimised specifically for solid-state drives (SSDs) using NAND flash memory.

It aims to address the unique characteristics and challenges associated with NAND flash, such as write amplification, over-provisioning, and the need for efficient mapping of logical addresses to physical locations.

Data Organization in Zones

  • ZNS organizes data into zones, which are contiguous regions of logical block addresses (LBAs) that must be written sequentially.

  • Each zone has a fixed size and is associated with a specific range of LBAs.

  • Zones can be in different states, such as Empty, Implicitly Open, Explicitly Open, Closed, Full, and Read-Only.

  • The host must write data to a zone sequentially, starting from the lowest LBA and progressing towards the highest LBA within the zone.

  • Once a zone is closed or marked as full, it cannot be written to again until it is reset.

Sequential Write Requirement

  • ZNS enforces a sequential write requirement within each zone, meaning that data must be written in a contiguous manner without skipping or overwriting LBAs.

  • This sequential write requirement aligns with the inherent characteristics of NAND flash memory, which is organised in pages and blocks.

  • Writing data sequentially minimises the need for garbage collection and reduces write amplification, as it avoids the need to constantly relocate and rewrite data.

  • Sequential writes also enable more efficient use of the NAND flash media, as it reduces the number of program/erase cycles required.

Reduced Write Amplification

  • Write amplification occurs when the actual amount of data written to the NAND flash is greater than the amount of data requested by the host.

  • In conventional SSDs, write amplification is caused by factors such as garbage collection, wear leveling, and the need to maintain a mapping table between logical and physical addresses.

  • By enforcing sequential writes within zones, ZNS minimises write amplification, as it reduces the need for frequent garbage collection and data relocation.

  • This leads to improved write performance, reduced wear on the NAND flash, and increased SSD endurance.

Reduced Over-Provisioning

  • Over-provisioning refers to the practice of reserving a portion of the SSD's raw capacity for internal operations, such as garbage collection and wear leveling.

  • In conventional SSDs, a significant amount of over-provisioning is required to ensure efficient operation and maintain performance.

  • With ZNS, the sequential write requirement within zones reduces the need for extensive over-provisioning.

  • This allows for more usable capacity to be exposed to the host, as less space needs to be reserved for internal SSD operations.

Simplified Mapping Table

  • In conventional SSDs, a mapping table is maintained to translate logical addresses (used by the host) to physical addresses (on the NAND flash).

  • As SSDs increase in capacity, the size of the mapping table grows, consuming more memory and computational resources.

  • ZNS simplifies the mapping table by leveraging the sequential write requirement within zones.

  • Instead of maintaining a mapping for each individual LBA, ZNS can use a more compact mapping at the zone level.

  • This reduces the memory footprint of the mapping table and improves the efficiency of address translation.

Host-Managed Zones

  • ZNS enables the host to have more control over data placement and management within the SSD.

  • The host can explicitly open and close zones, write data to specific zones, and track the state of each zone.

  • This host-managed approach allows for optimisations based on the specific workload and application requirements.

  • For example, the host can group related data into the same zone, enabling faster access and reducing the need for data movement.

Improved Performance and Endurance

  • By aligning with the sequential write nature of NAND flash and reducing write amplification, ZNS enables improved write performance compared to conventional SSDs.

  • The simplified mapping table and reduced over-provisioning also contribute to faster address translation and more efficient use of NAND flash media.

  • ZNS can lead to increased SSD endurance, as it minimises unnecessary write operations and reduces wear on the NAND flash cells.

  • This is particularly beneficial for write-intensive workloads, such as log storage, video recording, and continuous data capture.

Integration with NVMe

  • ZNS is designed to work seamlessly with the NVMe interface and command set.

  • It introduces new commands and data structures specific to zoned namespaces, such as the Zone Management Send and Zone Management Receive commands.

  • These commands allow the host to discover and manage zones, retrieve zone information, and perform zone-specific operations.

  • ZNS leverages the existing NVMe infrastructure, including queues, interrupts, and transport mechanisms, ensuring compatibility and ease of integration.

Use Cases and Applications

  • ZNS is particularly well-suited for applications that generate large amounts of sequential write data, such as video surveillance, logging, and data analytics.

  • It can also benefit applications that require efficient storage and retrieval of large files, such as media streaming and backup systems.

  • ZNS enables higher storage density, improved performance, and reduced cost per gigabyte compared to conventional SSDs.

ZNS represents a significant advancement in SSD technology, addressing the specific characteristics and challenges of NAND flash memory.

By organising data into zones and enforcing sequential writes, ZNS reduces write amplification, over-provisioning, and the size of the mapping table. This leads to increased capacity, improved performance, and extended SSD endurance.

The host-managed approach of ZNS allows for optimisations based on specific workloads and application requirements. It enables more efficient use of NAND flash media and provides greater control over data placement and management.

As SSDs continue to evolve and increase in capacity, ZNS offers a scalable and efficient solution for managing and accessing data. It aligns with the inherent characteristics of NAND flash and leverages the benefits of the NVMe interface to deliver high-performance, cost-effective storage solutions.

Overall, ZNS represents a significant step forward in SSD technology, enabling more efficient and optimized use of NAND flash memory. It empowers storage systems to meet the growing demands of data-intensive applications while improving performance, endurance, and capacity utilization.

Key Value (KV) Command Set and Unstructured Data

The Key Value Command Set introduces a new set of commands specifically designed for handling key-value pairs, which are commonly used in applications like databases and large-scale web services.

It defines data structures for representing key-value pairs, including formats for storing and accessing keys and values.

The command set includes operations such as Store, Retrieve, Delete, and Exist, which allow for efficient manipulation and querying of key-value pairs. It also defines additional status values and log pages specific to key-value operations, providing feedback and diagnostics to the host.

The KV command set in NVMe 2.0 is designed to efficiently handle unstructured data by allowing the host to access data using a key-value pair instead of logical block addresses.

This approach offers several benefits:

Simplified Data Access: With the KV command set, the host can directly access data using a unique key, eliminating the need to maintain a translation table that maps keys to logical block addresses. This simplifies the data access process and reduces overhead.

Reduced Metadata Overhead: By using key-value pairs, the KV command set eliminates the need for the host to manage and maintain a separate metadata structure. The key itself serves as the metadata, reducing the overall metadata overhead.

Efficient Unstructured Data Management: Unstructured data, such as documents, images, or videos, often have varying sizes and formats. The KV command set allows for efficient storage and retrieval of such data by using keys as identifiers, making it well-suited for object storage and NoSQL databases.

Example: Consider a large-scale object storage system storing user-generated content, such as photos and videos. With the KV command set, each object can be stored and retrieved using a unique key, such as a user ID or a timestamp. This allows for fast and efficient access to specific objects without the need for complex mapping tables.

NVMe Key Value Command Set

The NVMe Key Value Command Set is an extension to the NVM Express (NVMe) Base Specification that enables efficient handling of key-value pairs in NVMe-compliant storage devices.

It introduces new commands, data structures, and behaviours specifically tailored for key-value operations, while conforming to the overall NVMe architecture and conventions.

  1. NVMe Base Specification and Protocol:

    • The NVMe Base Specification defines a register-level interface for host software to communicate with non-volatile memory subsystems over various transports like PCIe, RDMA, and TCP.

    • It establishes a standardised way for hosts to issue commands, transfer data, and receive completion notifications from NVMe devices.

    • The Key Value Command Set builds upon this foundation, extending the protocol to support key-value operations.

  2. Key Value Command Set:

    • The Key Value Command Set introduces a new set of commands specifically designed for handling key-value pairs, which are commonly used in applications like databases and large-scale web services.

    • It defines data structures for representing key-value pairs, including formats for storing and accessing keys and values.

    • The command set includes operations such as Store, Retrieve, Delete, and Exist, which allow for efficient manipulation and querying of key-value pairs.

    • It also defines additional status values and log pages specific to key-value operations, providing feedback and diagnostics to the host.

  3. Relation to Other Specifications:

    • The Key Value Command Set is part of the broader NVMe family of specifications and interacts with other components of the NVMe ecosystem.

    • It extends the NVMe Base Specification by introducing key-value-specific commands and data structures while adhering to the overall NVMe protocol and conventions.

    • The command set is designed to work seamlessly with various NVMe transport specifications, such as PCIe, RDMA, and TCP, allowing key-value operations to be performed over different interconnects.

  4. Data Structures and Commands:

    • The Key Value Command Set defines new data structures specifically designed for managing key-value pairs.

    • These structures include formats for representing keys and values, specifying their sizes, and organizing them within namespaces.

    • The command set introduces commands like Store, Retrieve, Delete, and Exist, which are tailored for key-value operations.

    • These commands allow the host to store, retrieve, delete, and check the existence of key-value pairs in the NVMe device.

    • The specification also defines how these commands interact with the stored data, including handling of data sizes, addressing, and atomic operations.

  5. Theory of Operation:

    • The Key Value Command Set defines the operational model for key-value pairs within an NVM subsystem.

    • It specifies how keys and values are sized, stored, and accessed within namespaces.

    • The theory of operation covers aspects such as namespace management, including sizing and utilisation metrics for key-value storage.

    • It also describes the atomic nature of certain operations, ensuring data integrity during concurrent accesses and modifications.

    • The specification provides guidelines on how the host should interact with the NVMe device to perform key-value operations efficiently.

  6. Command Descriptions:

    • The specification provides detailed descriptions of each command in the Key Value Command Set.

    • Commands like List, Delete, and Exist allow the host to manage and query the key-value pairs stored in the NVMe device.

    • The Store and Retrieve commands enable the host to store and retrieve key-value pairs, specifying the sizes of keys and values and handling data transfer between the host and the device.

    • Each command has specific operational details, including input parameters, output formats, and expected behaviors.

In practice, the Key Value Command Set enables applications to leverage the high-performance, low-latency characteristics of NVMe devices for efficient storage and retrieval of key-value pairs.

It allows databases, caches, and other key-value-based systems to take advantage of the parallelism and scalability offered by NVMe technology.

By providing a standardised interface and command set for key-value operations, the Key Value Command Set simplifies the integration of NVMe devices into key-value storage architectures. It enables developers to design and optimize their applications to fully utilise the capabilities of NVMe devices for key-value workloads.

The Key Value Command Set is particularly beneficial in scenarios where fast access to individual key-value pairs is critical, such as in real-time data processing, content delivery networks, and large-scale web services. It allows applications to store and retrieve data with minimal overhead, leveraging the low-latency and high-throughput characteristics of NVMe.

Moreover, the Key Value Command Set's adherence to the NVMe Base Specification ensures compatibility and interoperability with existing NVMe ecosystems. It allows key-value-based applications to seamlessly integrate with NVMe devices and take advantage of the performance benefits offered by NVMe technology.

Overall, the NVMe Key Value Command Set provides a comprehensive framework for implementing efficient key-value storage on NVMe devices. It empowers applications to harness the full potential of NVMe for key-value workloads, enabling high-performance, scalable, and cost-effective storage solutions.

Endurance Group Management

NVMe 2.0 introduces the concept of endurance groups, which allows for more granular control over the allocation of media resources.

An endurance group represents a portion of the non-volatile memory in an NVMe subsystem that can be managed as a unit. By configuring and managing endurance groups, storage administrators can optimise performance and endurance based on the specific requirements of different applications or data types.

Benefits of Endurance Group Management:

Resource Allocation: Endurance groups allow for the allocation of media resources to specific applications or data types. This enables administrators to prioritise critical workloads and ensure they have the necessary resources to meet performance and endurance requirements.

Wear Leveling: By managing endurance groups separately, administrators can implement targeted wear leveling strategies. This helps distribute the wear across the media, extending the overall lifespan of the storage devices.

Quality of Service (QoS): Endurance groups can be assigned different QoS parameters, such as performance limits or prioritisation levels. This allows for better control over the performance and resource allocation for different workloads.

Example: In a database environment, an administrator could create separate endurance groups for transaction logs and data files. The transaction log endurance group could be configured with higher performance and endurance requirements, while the data file endurance group could be optimised for capacity. This separation allows for optimal resource utilization and ensures the critical transaction logs have the necessary performance and reliability.

Rotational Media Support

NVMe 2.0 adds support for rotational media, such as hard disk drives (HDDs), allowing them to be used in NVMe-based systems. This enhancement provides several benefits:

Unified Storage Architecture: With rotational media support, NVMe can serve as a common interface for both solid-state drives (SSDs) and HDDs. This enables a more unified storage architecture, simplifying system design and management.

Cost Optimisation: HDDs are generally less expensive than SSDs on a per-capacity basis. By supporting rotational media, NVMe 2.0 allows for the integration of cost-effective HDDs into NVMe-based systems, providing a balance between performance and cost.

Tiered Storage: NVMe 2.0's support for rotational media enables the implementation of tiered storage architectures. Critical or frequently accessed data can be stored on high-performance NVMe SSDs, while less performance-sensitive data can be stored on NVMe HDDs. This allows for optimal resource utilisation and cost efficiency.

Example: A large-scale data center could deploy NVMe-based storage systems that include both NVMe SSDs and NVMe HDDs. The SSDs could be used for hot data and caching, providing high-performance access to frequently accessed information. The HDDs could be used for cold data storage, offering cost-effective capacity for less frequently accessed data. This tiered approach maximises performance while minimising overall storage costs.

Benefits and Use Cases of NVMe 2.0 for GPU Clusters and AI Workloads

NVMe 2.0 introduces several new features and enhancements that can be combined to create highly optimised storage solutions for GPU clusters and AI workloads.

Here are some key benefits and use cases:

Key Value (KV) Command Set for Unstructured Data

  • AI workloads often involve large amounts of unstructured data, such as images, videos, and text documents.

  • The KV Command Set in NVMe 2.0 is designed to efficiently handle unstructured data by allowing the host to access data using key-value pairs instead of logical block addresses.

  • With the KV Command Set, AI applications can store and retrieve unstructured data using unique keys, such as object IDs or timestamps, without the need for complex mapping tables.

  • This simplified data access and reduced metadata overhead can significantly improve the performance and scalability of AI workloads that deal with unstructured data.

  • For example, in a large-scale image recognition system, the KV Command Set can be used to store and retrieve individual images using their unique identifiers, enabling fast and efficient access to specific images during training and inference.

Zoned Namespaces (ZNS) for Sequential Writes

  • Many AI workloads, such as training datasets and log files, involve large amounts of sequential write operations.

  • Zoned Namespaces (ZNS) in NVMe 2.0 are optimised for sequential writes by organising data into zones that must be written sequentially.

  • By aligning the write patterns of AI workloads with the sequential write requirement of ZNS, write amplification can be minimised, reducing wear on the underlying storage media and improving write performance.

  • ZNS also enables more efficient use of storage capacity by reducing the need for over-provisioning, allowing more usable capacity for AI datasets.

  • In a GPU cluster environment, ZNS can be leveraged to store and manage large training datasets efficiently, ensuring optimal write performance and minimizing the impact on the limited storage resources.

Endurance Group Management for Resource Allocation

  • AI workloads often have varying performance and endurance requirements for different types of data, such as frequently accessed model parameters and less frequently accessed historical data.

  • NVMe 2.0 introduces the concept of endurance groups, allowing for granular control over the allocation of storage media resources.

  • By creating separate endurance groups for different types of AI data and assigning appropriate quality of service (QoS) parameters, storage administrators can optimise performance and endurance based on the specific requirements of each data type.

  • For example, in a GPU cluster running multiple AI workloads, endurance groups can be created to prioritise the allocation of high-performance NVMe SSDs to critical model parameters and training data, while less performance-sensitive data can be stored on lower-cost NVMe HDDs.

Rotational Media Support for Cost-Effective Capacity

  • AI workloads often require large amounts of storage capacity for datasets, trained models, and intermediate results.

  • NVMe 2.0 adds support for rotational media, such as hard disk drives (HDDs), allowing them to be used alongside NVMe SSDs in a unified storage architecture.

  • By leveraging the cost-effectiveness of HDDs for storing less frequently accessed or archived AI data, while using high-performance NVMe SSDs for active datasets and model training, storage costs can be optimised without compromising performance.

  • In a GPU cluster environment, a tiered storage approach using NVMe SSDs and NVMe HDDs can provide a balance between performance and capacity, enabling efficient storage utilization for AI workloads.

Improved Performance and Scalability

  • The increased flexibility and performance of NVMe 2.0 make it an ideal choice for GPU clusters and AI workloads.

  • The ability to support multiple command sets and optimise for specific use cases can significantly improve I/O performance and reduce latency, enabling faster data access and processing.

  • Features like the KV Command Set and ZNS can enable more efficient data access patterns, reducing overhead and increasing throughput.

  • The modular design of NVMe 2.0 specifications allows for easier integration and management of storage in large-scale AI environments, simplifying storage architectures and enabling seamless scalability.

By leveraging the new features and enhancements introduced in NVMe 2.0, GPU clusters and AI workloads can benefit from optimised storage solutions that deliver high performance, efficient resource utilisation, and cost-effectiveness.

As AI continues to evolve and demands for storage performance and capacity grow, NVMe 2.0 provides a solid foundation for building scalable and efficient storage solutions. Its flexibility, performance, and optimisations make it an ideal choice for GPU clusters and AI environments, enabling organisations to harness the full potential of their data and accelerate AI innovation.

NVMe is primarily designed to work over , which provides high-speed, low-latency, direct access between the CPU/memory and storage devices. This eliminates the need for a separate storage controller, reducing latency and improving performance.

PCI Express (PCIe)
LogoNVM ExpressNVM Express - scalable, efficient, and industry standard
NVME Website
NVMe Family of Specifications
Page cover image