GaLora: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Last updated
Copyright Continuum Labs - 2023
Last updated
This March 2024 paper addresses the memory challenges in training large language models (LLMs) and proposes a novel approach called Gradient Low-Rank Projection (GaLore) to reduce memory usage while maintaining performance.
The growing size of weights and optimizer states in LLMs leads to significant memory requirements.
Pre-training a LLaMA 7B model from scratch with a single batch size requires at least 58 GB of memory, making it infeasible on consumer-level GPUs like NVIDIA RTX 4090 with 24 GB memory.
Low-rank adaptation (LoRA) reduces trainable parameters and optimizer states by adding a trainable low-rank matrix to the frozen pre-trained weight in each layer.
LoRA and its variant ReLoRA have limitations, such as underperforming full-rank training, requiring full-rank warm-up, and altering training dynamics.
GaLore is a training strategy that allows full-parameter learning while being more memory-efficient than common low-rank adaptation methods.
The key idea is to leverage the slow-changing low-rank structure of the gradient matrix G, rather than approximating the weight matrix itself as low-rank.
GaLore computes two projection matrices P and Q to project the gradient matrix G into a low-rank form P⊤GQ, reducing the memory cost of optimizer states.
Occasional updates of P and Q (e.g., every 200 iterations) incur minimal amortized additional computational cost.
GaLore is more memory-efficient than LoRA, yielding up to 30% memory reduction during pre-training.
8-bit GaLore, combined with 8-bit optimizers and layer-wise weight update techniques, achieves comparable performance to its full-rank counterpart with less than 10% memory cost of optimizer states.
GaLore enables, for the first time, the feasibility of pre-training a LLaMA 7B model from scratch on a single GPU with 24 GB memory (e.g., NVIDIA RTX 4090) without costly memory offloading techniques.
GaLore keeps low memory throughout the entire training, without requiring full-rank training warm-up like ReLoRA.
GaLore is used to fine-tune pre-trained LLMs on GLUE benchmarks with comparable or better results than existing low-rank methods.
When fine-tuning RoBERTa-Base on GLUE tasks with a rank of 4, GaLore outperforms LoRA.
As a gradient projection method, GaLore is independent of the choice of optimizers and can be easily plugged into existing ones with only two lines of code.
GaLore works for popular optimizers such as AdamW, 8-bit Adam, and Adafactor, and its performance is insensitive to the very few hyperparameters it introduces.
The paper provides theoretical justification for the low-rankness of gradient updates and convergence analysis of GaLore.
The paper discusses several methodologies for memory-efficient optimization in training large language models (LLMs). Here's a simplified comparison and contrast of the key approaches:
LoRA reduces memory footprint by maintaining a low-rank weight adaptor for each layer.
It introduces additional low-rank adaptors (A and B) to the fixed weight matrix (W0).
LoRA and its variants have limitations, such as underperforming full-rank training and requiring full-rank warm-up.
Subspace learning optimizes model weights within a low-rank subspace.
It leverages the finding that learning primarily occurs within a significantly low-dimensional parameter subspace.
This notion has been widely used in various domains of machine learning.
PGD is a traditional optimization method that studies gradients in the vector space.
GaLore is related to PGD but considers the specific gradient form that appears in training multi-layer neural networks.
GaLore proves properties of the gradients in the matrix space, while traditional PGD treats the objective as a general black-box nonlinear function.
Various methods have been proposed to reduce the memory cost of gradient statistics for adaptive optimization algorithms.
Adafactor achieves sub-linear memory cost by factorizing the second-order statistics using a row-column outer product.
Quantization is widely used to reduce the memory cost of optimizer states.
Fused gradient computation reduces the memory cost of storing weight gradients during training.
GaLore is a training strategy that allows full-parameter learning while being more memory-efficient than low-rank adaptation methods.
It leverages the slow-changing low-rank structure of the gradient matrix G.
GaLore computes projection matrices P and Q to project the gradient matrix G into a low-rank form, reducing the memory cost of optimizer states.
Unlike LoRA, GaLore explicitly utilizes low-rank updates instead of introducing additional low-rank adaptors, preserving the original training dynamics.
GaLore operates independently of the optimizers, as they directly receive the low-rank gradients without knowing their full-rank counterparts.
In summary, LoRA and subspace learning focus on optimizing model weights within a low-rank subspace, while GaLore leverages the low-rank structure of the gradient matrix to reduce memory cost. PGD is a traditional optimization method, and memory-efficient optimization techniques like Adafactor and quantization aim to reduce the memory cost of optimizer states. GaLore differs from LoRA by explicitly utilizing low-rank updates and preserving the original training dynamics, making it more memory-efficient while maintaining performance.
GaLore allows switching across low-rank subspaces during training to learn full-rank weights without increasing memory footprint.
The weight updates are accumulated within each subspace, and the projectors (P and Q) are re-initialized when switching to a new subspace.
The switching frequency (T) becomes a hyperparameter, with a sweet spot existing between too frequent and too infrequent changes.
The computational overhead induced by SVD for subspace switching is negligible compared to other memory-efficient training techniques.
GaLore significantly reduces the memory cost of optimizers that rely on component-wise gradient statistics, such as Adam.
By projecting the gradient G into its low-rank form R, Adam's gradient regularizer only needs to track low-rank gradient statistics.
GaLore can be applied to other optimizers (e.g., Adafactor) with similar update rules and memory requirements for gradient statistics.
To achieve the best memory-performance trade-off, GaLore uses only one projection matrix (P or Q) based on the dimensions of the weight matrix.
GaLore requires less memory than LoRA during training, as it does not need to store a separate low-rank factorization.
GaLore is compatible with existing memory-efficient optimization techniques, such as 8-bit optimizers and per-layer weight updates.
8-bit Adam optimizer maintains 32-bit optimizer performance at a fraction of the original memory footprint, and GaLore can be directly applied to its implementation.
Per-layer weight updates are adopted in GaLore to further reduce memory footprint by performing weight updates during backpropagation.
GaLore introduces very few additional hyperparameters: rank (r), subspace change frequency (T), and scale factor (α).
The rank (r) is also present in LoRA, while the subspace change frequency (T) is specific to GaLore.
The scale factor (α) controls the strength of the low-rank update and does not depend on the rank (r), unlike LoRA's scale factor (α/r).
GaLore's memory-efficient training techniques, such as low-rank subspace composition and memory-efficient optimisation, enable it to learn full-rank weights while significantly reducing memory footprint.
GaLore is compatible with ex
isting optimisation methods, such as 8-bit Adam and per-layer weight updates, further enhancing its memory efficiency. The introduction of very few additional hyperparameters makes GaLore easy to use and tune for optimal performance.
In conclusion, this paper introduces GaLore, a novel memory-efficient approach for pre-training and fine-tuning large language models.
GaLore employs gradient low-rank projection to significantly reduce the memory footprint required for storing model parameters and optimizer states, achieving up to 65.5% memory savings compared to traditional methods.
Extensive experiments on pre-training LLaMA models up to 7 billion parameters and fine-tuning on the GLUE benchmark demonstrate that GaLore maintains comparable performance to full-rank training while utilizing substantially less memory.
Notably, GaLore enables pre-training 7B models within the memory constraints of consumer GPUs like the RTX 4090, facilitating more accessible large model training.
The success of GaLore highlights the potential of gradient low-rank projection techniques for memory-efficient training of large models.
Looking ahead, future research can explore applying GaLore to other model architectures, further improving memory efficiency through quantization or specialized projection matrices, and enabling elastic distributed training on consumer hardware.
Ultimately, GaLore represents a promising step towards democratising the training of large language models by reducing the substantial computational resources traditionally required. By making large model training more accessible, GaLore could foster broader innovation and applications in natural language processing and beyond.