# Storage

As artificial intelligence (AI) and machine learning (ML) applications become increasingly sophisticated, the storage infrastructure supporting these workloads must adapt to meet new challenges.&#x20;

This article explores the unique storage requirements of AI workloads and discusses how storage solutions are evolving to address these needs.

### <mark style="color:purple;">The Challenges of AI Storage</mark>

#### <mark style="color:green;">Exponential Growth of Datasets</mark>

AI models, particularly large language models, require massive datasets for training. For example, the Common Crawl corpus used for training some language models has reached 13-15 petabytes. Storage solutions must be able to efficiently scale to accommodate these growing datasets.

#### <mark style="color:green;">High-Performance Data Access</mark>

AI workloads demand high-performance storage to keep GPUs fed with data and maintain high utilisation throughout the training and inference pipeline. Storage bottlenecks can significantly impact the performance of AI applications.

#### <mark style="color:green;">Varied I/O Patterns</mark>

AI workflows exhibit diverse I/O patterns across different stages, such as data ingestion, preparation, training, and inference. Storage solutions must be able to handle a mix of sequential and random I/O, as well as small and large file sizes.

#### <mark style="color:green;">Power Efficiency</mark>

As AI deployments scale, storage can consume a significant portion of the overall power budget. Energy-efficient storage solutions are crucial for reducing operational costs and environmental impact.

#### <mark style="color:green;">Distributed Architectures</mark>

AI workloads are increasingly distributed across core data centres, near edge, and far edge locations. Storage solutions must be able to support AI workflows across various environments with different connectivity and resource constraints.

### <mark style="color:purple;">Evolving Storage Technologies for AI</mark>

<mark style="color:blue;">High-Capacity SSDs</mark>

Solid-state drives (SSDs) with increased capacities, such as QLC (Quad-Level Cell) SSDs offering up to 61TB, can help accommodate growing AI datasets more efficiently than traditional hard disk drives (HDDs).

<mark style="color:blue;">High-Performance SSDs</mark>

SSDs with fast read and write speeds can significantly accelerate various stages of the AI workflow. For example, TLC (Triple-Level Cell) SSDs like the Intel 5520 offer balanced read and write performance for data preparation tasks, while QLC SSDs like the Intel 5536 provide high sequential read performance for training workloads.

<mark style="color:blue;">Non-Volatile Memory (NVM) Technologies</mark>

NVM technologies, such as Intel Optane, offer lower latency compared to traditional SSDs and can help reduce bottlenecks in high-performance AI environments.

<mark style="color:blue;">Decentralised Storage Systems (DSS)</mark>

Decentralised storage systems leverage object storage and NVMe-over-Fabrics (NVMe-oF) technologies to provide high-performance, scalable storage for AI workloads. These systems often disaggregate storage and compute resources to eliminate inter-node data transfer and improve performance.

<mark style="color:blue;">Cloud Storage Acceleration Layer (CSAL)</mark>

Software layers like CSAL can optimise data storage and retrieval processes by intelligently managing data across different types of storage media. For example, CSAL can combine SLC and QLC drives to balance performance and capacity during data ingestion.

<mark style="color:blue;">Direct GPU-to-Storage</mark>

Communication Technologies like NVIDIA GPUDirect Storage (GDS) enable direct data transfer between storage and GPU memory, bypassing the CPU and reducing latency. Storage solutions that integrate with GDS can further optimise performance for AI workloads.

### <mark style="color:purple;">Optimising Storage for AI Workloads</mark>

#### <mark style="color:green;">Hardware Configuration and Tuning</mark>

Careful hardware selection and configuration, such as using servers with balanced CPU, memory, and storage resources, can significantly impact AI storage performance. Tuning parameters like queue depths, block sizes, and parallelism can further optimise performance for specific workloads.

#### <mark style="color:green;">Software Stack Optimisation</mark>

Optimising the storage software stack, including file systems, drivers, and middleware, is crucial for achieving high performance.&#x20;

Techniques like minimizing data copies, leveraging RDMA (Remote Direct Memory Access), and using lightweight protocols can help reduce overhead and improve efficiency.

#### <mark style="color:green;">AI-Specific Benchmarking Tools</mark>

Traditional storage benchmarking tools may not accurately reflect the performance characteristics of AI workloads.&#x20;

AI-specific benchmarking tools, such as the one mentioned in the transcript that wraps around PyTorch data loader, can provide more relevant insights into storage performance for AI applications.

### <mark style="color:purple;">Conclusion</mark>

As AI workloads continue to push the boundaries of storage performance and capacity, the storage landscape must evolve to meet these demands.&#x20;

A combination of high-performance hardware, optimised software stacks, and AI-specific storage architectures will be essential for supporting the next generation of AI applications.&#x20;

Close collaboration between storage vendors, AI framework developers, and hardware manufacturers will be crucial for delivering end-to-end optimised solutions that can keep pace with the rapid advancement of AI technologies.
