SLORA
Last updated
Copyright Continuum Labs - 2023
This November 2023 paper presents a system called S-LoRA designed for scalable serving of many LoRA adapters derived from a single base language model.
The development of S-LoRA, a system capable of efficiently serving thousands of concurrent LoRA adapters, has significant commercial ramifications and opens up new possibilities in for large language model deployment and customisation.
From a commercial perspective, S-LoRA enables the scalable serving of numerous task-specific fine-tuned models derived from a single base model.
This means that businesses can now offer highly personalised and tailored language model services to their customers without incurring the high costs and computational resources associated with serving multiple full-sized models.
By leveraging S-LoRA, companies can provide a wide range of specialised models for various domains, industries, or even individual customers, all while maintaining a single base model.
The ability to serve a large number of LoRA adapters concurrently also opens up opportunities for new business models and revenue streams.
For example, a company could offer a subscription-based service where customers can access a vast library of pre-trained LoRA adapters for different tasks or domains.
Customers could then fine-tune these adapters further using their own data, creating highly customised models tailored to their specific needs.
This model of providing access to a diverse set of adapters could be particularly attractive to smaller businesses or start-ups that may not have the resources to train and maintain their own large language models from scratch.
The paper focuses on the "pretrain-then-finetune" paradigm commonly used in deploying large language models (LLMs).
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning method that adapts a base model to multiple tasks by updating only low-rank additive matrices called adapter weights.
Serving numerous fine-tuned LoRA adapters at scale is challenging and unexplored.
S-LoRA introduces a unified memory pool to manage dynamic adapter weights and KV cache tensors.
It uses a unified paging mechanism to reduce memory fragmentation and increase batch size.
S-LoRA employs custom CUDA kernels to efficiently batch LoRA computations for adapters with varying ranks.
The kernels operate on non-contiguous memory and align with the memory pool design for efficient batched inference.
S-LoRA TP (Tensor Parallelism)
A novel tensor parallelism strategy is introduced to parallelize across multiple GPUs with minimal communication overhead.
It schedules communications on small intermediate tensors and fuses large ones with the base model's communications.
LoRA introduces low-rank additive matrices to each layer of the base model during fine-tuning.
For a pre-trained weight matrix W, LoRA updates it as W' = W + AB, where A and B are low-rank matrices with rank r << min(h, d).
The forward pass after applying LoRA becomes: h = xW' = x(W + AB) = xW + xAB.
LoRA is typically applied only to the query, key, value, and output projection matrices in the self-attention module.
LLMs have high computational and memory demands due to their large parameter sizes.
The inference process involves iterative autoregressive decoding, requiring storing hidden states (KV cache) which adds to memory overhead.
Serving requests with varying sequence lengths dynamically is challenging.
S-LoRA is evaluated by serving Llama-7B/13B/30B/70B models.
It can serve thousands of LoRA adapters on a single GPU or across multiple GPUs with small overhead.
Compared to Huggingface PEFT, S-LoRA improves throughput by up to 30×.
Compared to vLLM with naïve LoRA serving support, S-LoRA improves throughput by up to 4× and increases the number of served adapters by several orders of magnitude.
S-LoRA addresses the challenges of serving numerous LoRA adapters at scale by introducing efficient memory management (Unified Paging), optimized computation kernels for heterogeneous batching, and a novel tensor parallelism strategy for multi-GPU parallelization.
The system achieves significant improvements in throughput and the number of adapters served compared to existing libraries and serving systems.
S-LoRA's batching strategy aims to support online and high-throughput serving of many LoRA adapters simultaneously.
Instead of merging adapter weights into the base model (as suggested in the original LoRA paper), S-LoRA computes the LoRA computation xAB on-the-fly.
This approach avoids weight duplication and enables batching of the more costly xW operation across different adapters.
S-LoRA batches the computation of the base model using GEMM and employs custom CUDA kernels to execute the additional xAB for all adapters separately.
Custom CUDA kernels are implemented for efficient computation without padding, considering the heterogeneity of sequence lengths and adapter ranks.
To enhance batching efficiency, S-LoRA proposes adapter clustering, which prioritises batching requests that use the same adapter.
By using fewer adapters in a running batch, more memory can be allocated to the KV cache, enabling larger batch sizes and potentially higher throughput.
However, adapter clustering involves trade-offs, such as potential impact on average latency or fairness among adapters.
S-LoRA applies an admission control strategy to sustain good attainment when the traffic exceeds the serving system capacity.
The serving system is characterised by a service level objective (SLO) specifying the desired latency for processing requests.
An early abort strategy is implemented to mimic admission control, estimating the set of latest requests that can be served within the SLO and serving them in the order of arrival time.
S-LoRA generalizes PagedAttention (introduced in vLLM) to Unified Paging, which supports dynamically loading LoRA adapters.
Unified Paging uses a unified memory pool to store the KV caches and adapter weights in a paged fashion.
This approach reduces fragmentation and balances the dynamically changing size of the KV caches and adapter weights.
All LoRA adapters are stored in the main memory, and only the adapters needed for the currently running batch are fetched to the GPU memory during inference.
The maximum number of adapters that can be served is bounded by the main memory size.
S-LoRA adopts the iteration-level scheduling batching strategy from Orca, scheduling requests at the token level and incorporating new requests into the running batch if space is available.
S-LoRA introduces a new tensor parallelism strategy to efficiently decouple the base model and LoRA adapters.
The details of this strategy will be discussed in Section 6 of the paper.
S-LoRA employs a batching strategy that separates the computation of the base model and LoRA adapters, using custom CUDA kernels for efficient computation.
Adapter clustering and admission control techniques are applied to enhance batching efficiency and sustain good attainment under high traffic.
The system also leverages a unified memory pool (Unified Paging) to manage KV caches and adapter weights dynamically, reducing fragmentation and enabling the serving of a large number of adapters bounded by the main memory size.
The chart illustrates the proposed tensor parallelism partition strategy for batched LoRA computation.
The upper box shows the base model's Megatron-LM partition strategy:
The first weight matrix (W1) is column-partitioned.
The second weight matrix (W2) is row-partitioned.
An all-reduce communication is required to accumulate the partial sum from distributed devices.
The lower box depicts the partitioning strategy for the added LoRA computation:
Matrices A1 and B1 for the adapter of the first weight matrix (W1) are column-partitioned.
An all-gather operation collects the intermediate results.
Matrices A2 and B2 for the adapter of the second weight (W2) are row-partitioned and column-partitioned, respectively.
An all-reduce operation sums up the intermediate results.
The result from the LoRA computation is added to that from the base model (add_2).
A single all-reduce operation accumulates the final results, fusing the all-gather operation for matmul_4 with the final all-reduce to optimize communication.
Different colours represent various partition strategies, including column partition, row partition, partial sum, and replication. The per-GPU shape of each tensor is annotated in gray. B is the number of tokens, h is the input dimension, N is the number of devices, d is the hidden size, and r is the adapter rank.
This strategy aims to align the partition strategies of inputs and outputs of the added LoRA computation with those of the base model, minimizing communication costs by avoiding unnecessary communications and fusing some communications.
The partition strategy can be easily adapted to the self-attention layer.
Similar to the Megatron-LM strategy, the head dimension of the self-attention layer is partitioned.
The query-key-value projection weight matrix is treated as W1, and the output projection weight matrix is treated as W2.
Let N be the number of devices, B be the number of tokens, h be the hidden size, and r be the adapter rank.
The communication cost of the base model is one all-reduce, or (2(N-1)Bh) / N.
The communication cost of the added LoRA computation is three all-gather for query, key, and value projections, and one all-reduce for the output projection, totaling (5(N-1)Br) / N.
The additional communication cost introduced by LoRA is negligible compared to the communication cost of the base model because r << h.
This is achieved by carefully scheduling communications on the small intermediate tensors of LoRA computation and fusing communications with the base model.
In terms of memory usage, the strategy is optimal because all weight matrices are partitioned among all devices, and there is no replicated weight matrix.
The proposed tensor parallelism strategy for batched LoRA inference effectively distributes the computation and memory usage of the additional LoRA adapters across multiple GPUs.
By aligning the partition strategies with the base model and carefully scheduling and fusing communications, the additional overhead introduced by LoRA is minimized. This enables efficient multi-GPU inference of large transformer models with multiple LoRA adapters.
The experiments are conducted using the Llama model series (Llama-7B, Llama-13B, Llama-30B, Llama-70B) with various adapter configurations.
The hardware setup includes single NVIDIA A10G GPU (24GB), single A100 GPU (40GB/80GB), and multiple A100 GPUs (40GB/80GB).
The baselines for comparison include HuggingFace PEFT, vLLM-packed, and variants of S-LoRA.
Metrics used for evaluation include throughput, average request latency, average first token latency, SLO attainment, and user satisfaction.
Synthetic workload traces are generated using the Gamma process with various combinations of parameters (n, α, R, cv).
S-LoRA can serve up to 2,000 adapters simultaneously with minimal overhead for the added LoRA computation.
vLLM-packed can only serve fewer than 5 adapters due to GPU memory constraints and has lower throughput due to missed batching opportunities.
PEFT lacks advanced batching methods and memory management, resulting in significantly worse performance compared to S-LoRA.
S-LoRA achieves up to 4x higher throughput than vLLM-packed and up to 30x higher than PEFT while supporting a significantly larger number of adapters.
S-LoRA outperforms its variants (S-LoRA-bmm and S-LoRA-no-unify-mem) in terms of throughput and latency, demonstrating the effectiveness of the memory pool and custom kernels.
S-LoRA's throughput remains stable once the number of adapters reaches a certain threshold, indicating its scalability.
Real-world serving traces are constructed by downsampling from the traces of LMSYS Chatbot Arena.
The results on real workloads show a similar pattern to the synthetic workloads, confirming S-LoRA's strong performance in real-world scenarios.