LogoLogo
Continuum WebsiteContinuum ApplicationsContinuum KnowledgeAxolotl Platform
Continuum Knowledge
Continuum Knowledge
  • Continuum
  • Data
    • Datasets
      • Pre Training Data
      • Types of Fine Tuning
      • Self Instruct Paper
      • Self-Alignment with Instruction Backtranslation
      • Systematic Evaluation of Instruction-Tuned Large Language Models on Open Datasets
      • Instruction Tuning
      • Instruction Fine Tuning - Alpagasus
      • Less is More For Alignment
      • Enhanced Supervised Fine Tuning
      • Visualising Data using t-SNE
      • UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
      • Training and Evaluation Datasets
      • What is perplexity?
  • MODELS
    • Foundation Models
      • The leaderboard
      • Foundation Models
      • LLama 2 - Analysis
      • Analysis of Llama 3
      • Llama 3.1 series
      • Google Gemini 1.5
      • Platypus: Quick, Cheap, and Powerful Refinement of LLMs
      • Mixtral of Experts
      • Mixture-of-Agents (MoA)
      • Phi 1.5
        • Refining the Art of AI Training: A Deep Dive into Phi 1.5's Innovative Approach
      • Phi 2.0
      • Phi-3 Technical Report
  • Training
    • The Fine Tuning Process
      • Why fine tune?
        • Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
        • Explanations in Fine Tuning
      • Tokenization
        • Tokenization Is More Than Compression
        • Tokenization - SentencePiece
        • Tokenization explore
        • Tokenizer Choice For LLM Training: Negligible or Crucial?
        • Getting the most out of your tokenizer for pre-training and domain adaptation
        • TokenMonster
      • Parameter Efficient Fine Tuning
        • P-Tuning
          • The Power of Scale for Parameter-Efficient Prompt Tuning
        • Prefix-Tuning: Optimizing Continuous Prompts for Generation
        • Harnessing the Power of PEFT: A Smarter Approach to Fine-tuning Pre-trained Models
        • What is Low-Rank Adaptation (LoRA) - explained by the inventor
        • Low Rank Adaptation (Lora)
        • Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
        • QLORA: Efficient Finetuning of Quantized LLMs
        • Bits and Bytes
        • The Magic behind Qlora
        • Practical Guide to LoRA: Tips and Tricks for Effective Model Adaptation
        • The quantization constant
        • QLORA: Efficient Finetuning of Quantized Language Models
        • QLORA and Fine-Tuning of Quantized Language Models (LMs)
        • ReLoRA: High-Rank Training Through Low-Rank Updates
        • SLoRA: Federated Parameter Efficient Fine-Tuning of Language Models
        • GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
      • Hyperparameters
        • Batch Size
        • Padding Tokens
        • Mixed precision training
        • FP8 Formats for Deep Learning
        • Floating Point Numbers
        • Batch Size and Model loss
        • Batch Normalisation
        • Rethinking Learning Rate Tuning in the Era of Language Models
        • Sample Packing
        • Gradient accumulation
        • A process for choosing the learning rate
        • Learning Rate Scheduler
        • Checkpoints
        • A Survey on Efficient Training of Transformers
        • Sequence Length Warmup
        • Understanding Training vs. Evaluation Data Splits
        • Cross-entropy loss
        • Weight Decay
        • Optimiser
        • Caching
      • Training Processes
        • Extending the context window
        • PyTorch Fully Sharded Data Parallel (FSDP)
        • Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
        • YaRN: Efficient Context Window Extension of Large Language Models
        • Sliding Window Attention
        • LongRoPE
        • Reinforcement Learning
        • An introduction to reinforcement learning
        • Reinforcement Learning from Human Feedback (RLHF)
        • Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  • INFERENCE
    • Why is inference important?
      • Grouped Query Attention
      • Key Value Cache
      • Flash Attention
      • Flash Attention 2
      • StreamingLLM
      • Paged Attention and vLLM
      • TensorRT-LLM
      • Torchscript
      • NVIDIA L40S GPU
      • Triton Inference Server - Introduction
      • Triton Inference Server
      • FiDO: Fusion-in-Decoder optimised for stronger performance and faster inference
      • Is PUE a useful measure of data centre performance?
      • SLORA
  • KNOWLEDGE
    • Vector Databases
      • A Comprehensive Survey on Vector Databases
      • Vector database management systems: Fundamental concepts, use-cases, and current challenges
      • Using the Output Embedding to Improve Language Models
      • Decoding Sentence-BERT
      • ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
      • SimCSE: Simple Contrastive Learning of Sentence Embeddings
      • Questions Are All You Need to Train a Dense Passage Retriever
      • Improving Text Embeddings with Large Language Models
      • Massive Text Embedding Benchmark
      • RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking
      • LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
      • Embedding and Fine-Tuning in Neural Language Models
      • Embedding Model Construction
      • Demystifying Embedding Spaces using Large Language Models
      • Fine-Tuning Llama for Multi-Stage Text Retrieval
      • Large Language Model Based Text Augmentation Enhanced Personality Detection Model
      • One Embedder, Any Task: Instruction-Finetuned Text Embeddings
      • Vector Databases are not the only solution
      • Knowledge Graphs
        • Harnessing Knowledge Graphs to Elevate AI: A Technical Exploration
        • Unifying Large Language Models and Knowledge Graphs: A Roadmap
      • Approximate Nearest Neighbor (ANN)
      • High Dimensional Data
      • Principal Component Analysis (PCA)
      • Vector Similarity Search - HNSW
      • FAISS (Facebook AI Similarity Search)
      • Unsupervised Dense Retrievers
    • Retrieval Augmented Generation
      • Retrieval-Augmented Generation for Large Language Models: A Survey
      • Fine-Tuning or Retrieval?
      • Revolutionising Information Retrieval: The Power of RAG in Language Models
      • A Survey on Retrieval-Augmented Text Generation
      • REALM: Retrieval-Augmented Language Model Pre-Training
      • Retrieve Anything To Augment Large Language Models
      • Generate Rather Than Retrieve: Large Language Models Are Strong Context Generators
      • Active Retrieval Augmented Generation
      • DSPy: LM Assertions: Enhancing Language Model Pipelines with Computational Constraints
      • DSPy: Compiling Declarative Language Model Calls
      • DSPy: In-Context Learning for Extreme Multi-Label Classification
      • Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
      • HYDE: Revolutionising Search with Hypothetical Document Embeddings
      • Enhancing Recommender Systems with Large Language Model Reasoning Graphs
      • Retrieval Augmented Generation (RAG) versus fine tuning
      • RAFT: Adapting Language Model to Domain Specific RAG
      • Summarisation Methods and RAG
      • Lessons Learned on LLM RAG Solutions
      • Stanford: Retrieval Augmented Language Models
      • Overview of RAG Approaches with Vector Databases
      • Mastering Chunking in Retrieval-Augmented Generation (RAG) Systems
    • Semantic Routing
    • Resource Description Framework (RDF)
  • AGENTS
    • What is agency?
      • Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves
      • Types of Agents
      • The risk of AI agency
      • Understanding Personality in Large Language Models: A New Frontier in AI Psychology
      • AI Agents - Reasoning, Planning, and Tool Calling
      • Personality and Brand
      • Agent Interaction via APIs
      • Bridging Minds and Machines: The Legacy of Newell, Shaw, and Simon
      • A Survey on Language Model based Autonomous Agents
      • Large Language Models as Agents
      • AI Reasoning: A Deep Dive into Chain-of-Thought Prompting
      • Enhancing AI Reasoning with Self-Taught Reasoner (STaR)
      • Exploring the Frontier of AI: The "Tree of Thoughts" Framework
      • Toolformer: Revolutionising Language Models with API Integration - An Analysis
      • TaskMatrix.AI: Bridging Foundational AI Models with Specialised Systems for Enhanced Task Completion
      • Unleashing the Power of LLMs in API Integration: The Rise of Gorilla
      • Andrew Ng's presentation on AI agents
      • Making AI accessible with Andrej Karpathy and Stephanie Zhan
  • Regulation and Ethics
    • Regulation and Ethics
      • Privacy
      • Detecting AI Generated content
      • Navigating the IP Maze in AI: The Convergence of Blockchain, Web 3.0, and LLMs
      • Adverse Reactions to generative AI
      • Navigating the Ethical Minefield: The Challenge of Security in Large Language Models
      • Navigating the Uncharted Waters: The Risks of Autonomous AI in Military Decision-Making
  • DISRUPTION
    • Data Architecture
      • What is a data pipeline?
      • What is Reverse ETL?
      • Unstructured Data and Generatve AI
      • Resource Description Framework (RDF)
      • Integrating generative AI with the Semantic Web
    • Search
      • BM25 - Search Engine Ranking Function
      • BERT as a reranking engine
      • BERT and Google
      • Generative Engine Optimisation (GEO)
      • Billion-scale similarity search with GPUs
      • FOLLOWIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
      • Neural Collaborative Filtering
      • Federated Neural Collaborative Filtering
      • Latent Space versus Embedding Space
      • Improving Text Embeddings with Large Language Models
    • Recommendation Engines
      • On Interpretation and Measurement of Soft Attributes for Recommendation
      • A Survey on Large Language Models for Recommendation
      • Model driven recommendation systems
      • Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
      • Foundation Models for Recommender Systems
      • Exploring the Impact of Large Language Models on Recommender Systems: An Extensive Review
      • AI driven recommendations - harming autonomy?
    • Logging
      • A Taxonomy of Anomalies in Log Data
      • Deeplog
      • LogBERT: Log Anomaly Detection via BERT
      • Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
      • Log-based Anomaly Detection with Deep Learning: How Far Are We?
      • Deep Learning for Anomaly Detection in Log Data: A Survey
      • LogGPT
      • Adaptive Semantic Gate Networks (ASGNet) for log-based anomaly diagnosis
  • Infrastructure
    • The modern data centre
      • Enhancing Data Centre Efficiency: Strategies to Improve PUE
      • TCO of NVIDIA GPUs and falling barriers to entry
      • Maximising GPU Utilisation with Kubernetes and NVIDIA GPU Operator
      • Data Centres
      • Liquid Cooling
    • Servers and Chips
      • The NVIDIA H100 GPU
      • NVIDIA H100 NVL
      • Lambda Hyperplane 8-H100
      • NVIDIA DGX Servers
      • NVIDIA DGX-2
      • NVIDIA DGX H-100 System
      • NVLink Switch
      • Tensor Cores
      • NVIDIA Grace Hopper Superchip
      • NVIDIA Grace CPU Superchip
      • NVIDIA GB200 NVL72
      • Hopper versus Blackwell
      • HGX: High-Performance GPU Platforms
      • ARM Chips
      • ARM versus x86
      • RISC versus CISC
      • Introduction to RISC-V
    • Networking and Connectivity
      • Infiniband versus Ethernet
      • NVIDIA Quantum InfiniBand
      • PCIe (Peripheral Component Interconnect Express)
      • NVIDIA ConnectX InfiniBand adapters
      • NVMe (Non-Volatile Memory Express)
      • NVMe over Fabrics (NVMe-oF)
      • NVIDIA Spectrum-X
      • NVIDIA GPUDirect
      • Evaluating Modern GPU Interconnect
      • Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
      • Next-generation networking in AI environments
      • NVIDIA Collective Communications Library (NCCL)
    • Data and Memory
      • NVIDIA BlueField Data Processing Units (DPUs)
      • Remote Direct Memory Access (RDMA)
      • High Bandwidth Memory (HBM3)
      • Flash Memory
      • Model Requirements
      • Calculating GPU memory for serving LLMs
      • Transformer training costs
      • GPU Performance Optimisation
    • Libraries and Complements
      • NVIDIA Base Command
      • NVIDIA AI Enterprise
      • CUDA - NVIDIA GTC 2024 presentation
      • RAPIDs
      • RAFT
    • Vast Data Platform
      • Vast Datastore
      • Vast Database
      • Vast Data Engine
      • DASE (Disaggregated and Shared Everything)
      • Dremio and VAST Data
    • Storage
      • WEKA: A High-Performance Storage Solution for AI Workloads
      • Introduction to NVIDIA GPUDirect Storage (GDS)
        • GDS cuFile API
      • NVIDIA Magnum IO GPUDirect Storage (GDS)
      • Vectors in Memory
Powered by GitBook
LogoLogo

Continuum - Accelerated Artificial Intelligence

  • Continuum Website
  • Axolotl Platform

Copyright Continuum Labs - 2023

On this page
  • Commercial Application
  • Background and Motivation
  • Key Contributions of S-LoRA: a. Unified Paging
  • Batching and Scheduling
  • Evaluation

Was this helpful?

  1. INFERENCE
  2. Why is inference important?

SLORA

PreviousIs PUE a useful measure of data centre performance?NextVector Databases

Last updated 11 months ago

Was this helpful?

This November 2023 paper presents a system called S-LoRA designed for scalable serving of many LoRA adapters derived from a single base language model.

The development of S-LoRA, a system capable of efficiently serving thousands of concurrent LoRA adapters, has significant commercial ramifications and opens up new possibilities in for large language model deployment and customisation.

From a commercial perspective, S-LoRA enables the scalable serving of numerous task-specific fine-tuned models derived from a single base model.

This means that businesses can now offer highly personalised and tailored language model services to their customers without incurring the high costs and computational resources associated with serving multiple full-sized models.

By leveraging S-LoRA, companies can provide a wide range of specialised models for various domains, industries, or even individual customers, all while maintaining a single base model.

Commercial Application

The ability to serve a large number of LoRA adapters concurrently also opens up opportunities for new business models and revenue streams.

For example, a company could offer a subscription-based service where customers can access a vast library of pre-trained LoRA adapters for different tasks or domains.

Customers could then fine-tune these adapters further using their own data, creating highly customised models tailored to their specific needs.

This model of providing access to a diverse set of adapters could be particularly attractive to smaller businesses or start-ups that may not have the resources to train and maintain their own large language models from scratch.

Background and Motivation

  • The paper focuses on the "pretrain-then-finetune" paradigm commonly used in deploying large language models (LLMs).

  • Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that adapts a base model to multiple tasks by updating only low-rank additive matrices called adapter weights.

  • Serving numerous fine-tuned LoRA adapters at scale is challenging and unexplored.

Key Contributions of S-LoRA: a. Unified Paging

  • S-LoRA introduces a unified memory pool to manage dynamic adapter weights and KV cache tensors.

  • It uses a unified paging mechanism to reduce memory fragmentation and increase batch size.

Heterogeneous Batching

  • S-LoRA employs custom CUDA kernels to efficiently batch LoRA computations for adapters with varying ranks.

  • The kernels operate on non-contiguous memory and align with the memory pool design for efficient batched inference.

S-LoRA TP (Tensor Parallelism)

  • A novel tensor parallelism strategy is introduced to parallelize across multiple GPUs with minimal communication overhead.

  • It schedules communications on small intermediate tensors and fuses large ones with the base model's communications.

Technical Details:

  • LoRA introduces low-rank additive matrices to each layer of the base model during fine-tuning.

  • For a pre-trained weight matrix W, LoRA updates it as W' = W + AB, where A and B are low-rank matrices with rank r << min(h, d).

  • The forward pass after applying LoRA becomes: h = xW' = x(W + AB) = xW + xAB.

  • LoRA is typically applied only to the query, key, value, and output projection matrices in the self-attention module.

Serving Challenges

  • LLMs have high computational and memory demands due to their large parameter sizes.

  • The inference process involves iterative autoregressive decoding, requiring storing hidden states (KV cache) which adds to memory overhead.

  • Serving requests with varying sequence lengths dynamically is challenging.

Evaluation and Results

  • S-LoRA is evaluated by serving Llama-7B/13B/30B/70B models.

  • It can serve thousands of LoRA adapters on a single GPU or across multiple GPUs with small overhead.

  • Compared to Huggingface PEFT, S-LoRA improves throughput by up to 30×.

  • Compared to vLLM with naïve LoRA serving support, S-LoRA improves throughput by up to 4× and increases the number of served adapters by several orders of magnitude.

S-LoRA addresses the challenges of serving numerous LoRA adapters at scale by introducing efficient memory management (Unified Paging), optimized computation kernels for heterogeneous batching, and a novel tensor parallelism strategy for multi-GPU parallelization.

The system achieves significant improvements in throughput and the number of adapters served compared to existing libraries and serving systems.

Batching and Scheduling

Batching Strategy

  • S-LoRA's batching strategy aims to support online and high-throughput serving of many LoRA adapters simultaneously.

  • Instead of merging adapter weights into the base model (as suggested in the original LoRA paper), S-LoRA computes the LoRA computation xAB on-the-fly.

  • This approach avoids weight duplication and enables batching of the more costly xW operation across different adapters.

  • S-LoRA batches the computation of the base model using GEMM and employs custom CUDA kernels to execute the additional xAB for all adapters separately.

  • Custom CUDA kernels are implemented for efficient computation without padding, considering the heterogeneity of sequence lengths and adapter ranks.

Adapter Clustering

  • To enhance batching efficiency, S-LoRA proposes adapter clustering, which prioritises batching requests that use the same adapter.

  • By using fewer adapters in a running batch, more memory can be allocated to the KV cache, enabling larger batch sizes and potentially higher throughput.

  • However, adapter clustering involves trade-offs, such as potential impact on average latency or fairness among adapters.

Admission Control

  • S-LoRA applies an admission control strategy to sustain good attainment when the traffic exceeds the serving system capacity.

  • The serving system is characterised by a service level objective (SLO) specifying the desired latency for processing requests.

  • An early abort strategy is implemented to mimic admission control, estimating the set of latest requests that can be served within the SLO and serving them in the order of arrival time.

Memory Management

  • S-LoRA generalizes PagedAttention (introduced in vLLM) to Unified Paging, which supports dynamically loading LoRA adapters.

  • Unified Paging uses a unified memory pool to store the KV caches and adapter weights in a paged fashion.

  • This approach reduces fragmentation and balances the dynamically changing size of the KV caches and adapter weights.

  • All LoRA adapters are stored in the main memory, and only the adapters needed for the currently running batch are fetched to the GPU memory during inference.

  • The maximum number of adapters that can be served is bounded by the main memory size.

  • S-LoRA adopts the iteration-level scheduling batching strategy from Orca, scheduling requests at the token level and incorporating new requests into the running batch if space is available.

Tensor Parallelism Strategy

  • S-LoRA introduces a new tensor parallelism strategy to efficiently decouple the base model and LoRA adapters.

  • The details of this strategy will be discussed in Section 6 of the paper.

S-LoRA employs a batching strategy that separates the computation of the base model and LoRA adapters, using custom CUDA kernels for efficient computation.

Adapter clustering and admission control techniques are applied to enhance batching efficiency and sustain good attainment under high traffic.

The system also leverages a unified memory pool (Unified Paging) to manage KV caches and adapter weights dynamically, reducing fragmentation and enabling the serving of a large number of adapters bounded by the main memory size.

The chart illustrates the proposed tensor parallelism partition strategy for batched LoRA computation.

The upper box shows the base model's Megatron-LM partition strategy:

  • The first weight matrix (W1) is column-partitioned.

  • The second weight matrix (W2) is row-partitioned.

  • An all-reduce communication is required to accumulate the partial sum from distributed devices.

The lower box depicts the partitioning strategy for the added LoRA computation:

  • Matrices A1 and B1 for the adapter of the first weight matrix (W1) are column-partitioned.

  • An all-gather operation collects the intermediate results.

  • Matrices A2 and B2 for the adapter of the second weight (W2) are row-partitioned and column-partitioned, respectively.

  • An all-reduce operation sums up the intermediate results.

  • The result from the LoRA computation is added to that from the base model (add_2).

  • A single all-reduce operation accumulates the final results, fusing the all-gather operation for matmul_4 with the final all-reduce to optimize communication.

Different colours represent various partition strategies, including column partition, row partition, partial sum, and replication. The per-GPU shape of each tensor is annotated in gray. B is the number of tokens, h is the input dimension, N is the number of devices, d is the hidden size, and r is the adapter rank.

This strategy aims to align the partition strategies of inputs and outputs of the added LoRA computation with those of the base model, minimizing communication costs by avoiding unnecessary communications and fusing some communications.

Adaptation to Self-Attention Layer

  • The partition strategy can be easily adapted to the self-attention layer.

  • Similar to the Megatron-LM strategy, the head dimension of the self-attention layer is partitioned.

  • The query-key-value projection weight matrix is treated as W1, and the output projection weight matrix is treated as W2.

Communication and Memory Cost Analysis

  • Let N be the number of devices, B be the number of tokens, h be the hidden size, and r be the adapter rank.

  • The communication cost of the base model is one all-reduce, or (2(N-1)Bh) / N.

  • The communication cost of the added LoRA computation is three all-gather for query, key, and value projections, and one all-reduce for the output projection, totaling (5(N-1)Br) / N.

  • The additional communication cost introduced by LoRA is negligible compared to the communication cost of the base model because r << h.

  • This is achieved by carefully scheduling communications on the small intermediate tensors of LoRA computation and fusing communications with the base model.

  • In terms of memory usage, the strategy is optimal because all weight matrices are partitioned among all devices, and there is no replicated weight matrix.

The proposed tensor parallelism strategy for batched LoRA inference effectively distributes the computation and memory usage of the additional LoRA adapters across multiple GPUs.

By aligning the partition strategies with the base model and carefully scheduling and fusing communications, the additional overhead introduced by LoRA is minimized. This enables efficient multi-GPU inference of large transformer models with multiple LoRA adapters.

Evaluation

Setup

  • The experiments are conducted using the Llama model series (Llama-7B, Llama-13B, Llama-30B, Llama-70B) with various adapter configurations.

  • The hardware setup includes single NVIDIA A10G GPU (24GB), single A100 GPU (40GB/80GB), and multiple A100 GPUs (40GB/80GB).

  • The baselines for comparison include HuggingFace PEFT, vLLM-packed, and variants of S-LoRA.

  • Metrics used for evaluation include throughput, average request latency, average first token latency, SLO attainment, and user satisfaction.

End-to-End Results on Synthetic Workloads

  • Synthetic workload traces are generated using the Gamma process with various combinations of parameters (n, α, R, cv).

  • S-LoRA can serve up to 2,000 adapters simultaneously with minimal overhead for the added LoRA computation.

  • vLLM-packed can only serve fewer than 5 adapters due to GPU memory constraints and has lower throughput due to missed batching opportunities.

  • PEFT lacks advanced batching methods and memory management, resulting in significantly worse performance compared to S-LoRA.

  • S-LoRA achieves up to 4x higher throughput than vLLM-packed and up to 30x higher than PEFT while supporting a significantly larger number of adapters.

  • S-LoRA outperforms its variants (S-LoRA-bmm and S-LoRA-no-unify-mem) in terms of throughput and latency, demonstrating the effectiveness of the memory pool and custom kernels.

  • S-LoRA's throughput remains stable once the number of adapters reaches a certain threshold, indicating its scalability.

End-to-End Results on Real Workloads

  • Real-world serving traces are constructed by downsampling from the traces of LMSYS Chatbot Arena.

  • The results on real workloads show a similar pattern to the synthetic workloads, confirming S-LoRA's strong performance in real-world scenarios.

LogoS-LoRA: Serving Thousands of Concurrent LoRA AdaptersarXiv.org
Page cover image