ReLoRA: High-Rank Training Through Low-Rank Updates
This December 2023 paper introduces ReLoRA, a method for efficiently training large neural networks using low-rank updates.
The authors argue that despite the current trend of training increasingly large networks with hundreds of billions of parameters, the necessity and theoretical understanding of such overparametrised models remain unclear.
ReLoRA aims to address this issue by demonstrating that low-rank updates can be used to train high-rank networks efficiently, potentially challenging the current scaling laws that govern large neural networks.
The paper focuses on applying ReLoRA to pre-training transformer language models with up to 350 million parameters, achieving comparable performance to regular neural network training.
The authors suggest that the efficiency of ReLoRA increases with the model size, making it a promising approach for training multi-billion-parameter networks more efficiently.
The paper also discusses the complex relationship between overparametrization and the trainability and generalization of neural networks, referencing concepts such as the Lottery Ticket Hypothesis and parameter-efficient fine-tuning methods like LoRA (Low-Rank Adapters) and Compacter.
ReLoRA builds upon these ideas by introducing a method that increases the effective rank of the update in a neural network through restarts, partial optimizer resets, and a jagged-cosine learning rate schedule.
The authors provide a mathematical foundation for ReLoRA, explaining how it expands on the basic idea of LoRA by allowing for multiple restarts, thereby increasing the total rank of the update over time.
The paper reports on experiments with transformer language models, emphasizing the efficiency of ReLoRA in terms of both computational resources and training time.
Overall, the paper presents ReLoRA as an innovative approach to efficiently training large-scale neural networks, particularly transformers, by combining low-rank updates with specific training techniques.
The authors suggest that their findings could have significant implications for the scaling laws that govern large neural networks and contribute to a better understanding of how to efficiently scale up these models.
Last updated