Practical Tips for Fine-tuning LMs Using LoRA (Low-Rank Adaptation)
Last updated
Copyright Continuum Labs - 2023
Last updated
This is a terrific article from the genius Sebastian Raschka, Phd.
The article provides valuable insights and lessons learned from the author's experiments with Low-rank Adaptation (LoRA), a widely used technique for efficiently training custom large language models (LLMs).
The main takeaways include the consistency of LoRA training outcomes across multiple runs, the trade-off's of using QLoRA (quantized LoRA) for memory savings, and the minimal impact of optimizer choice on LLM fine-tuning.
The author also discusses the importance of applying LoRA across all layers, adjusting the LoRA rank and alpha value, and the feasibility of fine-tuning 7 billion parameter models on a single GPU.
Additionally, the article addresses common questions related to LoRA, such as the significance of the dataset, the effectiveness of LoRA for domain adaptation, and strategies for avoiding overfitting.
The author compares LoRA to full fine-tuning and RLHF (Reinforcement Learning with Human Feedback), highlighting the memory efficiency and performance of LoRA.
The article also explores the possibility of combining multiple sets of LoRA weights and discusses the concept of Layer-wise Optimal Rank Adaptation.
Consistency in LLM Training: Even though there's some randomness in training language models (LMs) and models on GPUs, the results are usually pretty consistent when you run the training multiple times.
QLoRA for Memory Efficiency: QLoRA is a good choice when you don't have a lot of GPU memory. It can save about a third of the memory but makes training about 39% slower. It's a good option if memory is your biggest problem.
Optimiser Choice in Fine-Tuning: It doesn't make a big difference which optimiser you choose (AdamW, SGD with scheduler, AdamW with scheduler). SGD by itself isn't as good, but the others are all pretty similar.
Adam Optimizer and Memory Usage: The Adam optimizer uses more memory because it has two extra numbers for each number in the model. But for LMs, this doesn't make the memory usage a lot higher because most of the memory is used for big matrix calculations, not for storing the extra numbers.
Multi-Epoch Training and Static Datasets: Training on the same dataset multiple times (multi-epoch training) might not help and could even make the model worse, probably because it starts to overfit the data.
Application of LoRA: To make the model work its best, use LoRA on all the layers, not just the Key and Value matrices.
Adjusting LoRA Parameters: It's important to choose the right LoRA rank and alpha value. A good rule of thumb is to make alpha twice as big as the rank.
Fine-tuning 7 Billion Parameter Models: You can finetune these big models in just a few hours on a single GPU with 14 GB of memory. But it's hard to make an LLM do well on all benchmark tasks with just one dataset. You might need to use different datasets or tools.
The article talks about experiments where LoRA was first used only on Key and Value weight matrices in transformer layers.
Using it on Query weight matrices, projection layers, and other linear layers too makes the number of trainable parameters much bigger (from about 4.2 million to over 20 million for a model with 7 billion parameters).
This uses more memory (16.62 GB instead of 14.18 GB) but can make the model perform noticeably better.
The author says they only tried two settings (LoRA for just the query and value matrices, and LoRA for all layers) and suggests that future experiments should look at other combinations, like what happens if you use LoRA for projection layers.
Balancing LoRA Hyperparameters - Rank (R) and Alpha (α)
The article explains the importance of the scaling coefficient in LoRA, which uses the rank parameter (r) and another hyperparameter α (alpha).
The formula for scaling is α / r, and the LoRA weights' influence gets bigger with this scaling factor. The author tried different rank values and found that making α twice as big as r usually gives the best results. This was especially clear when r was set to 256, where the best α was found to be 512.
Training 7 Billion Parameter Models on a Single GPU
One of the big benefits of LoRA, as the article points out, is that it lets you fine-tune big models (like a model with 7 billion parameters) on just one GPU.
Using QLoRA with the best settings (r=256 and α=512) and an AdamW optimizer, you can fine-tune a model this big in about 3 hours on an A100 GPU, even with a big training dataset (like the Alpaca dataset with 50,000 examples).