Batch Size and Model loss
Last updated
Copyright Continuum Labs - 2023
Last updated
The relationship between batch size and model loss is complex.
Initial findings suggest that increasing batch size may lower performance; however, adjusting the learning rate in conjunction with batch size changes can yield similar performances across varying batch sizes.
The 2018 paper called "Don’t Decay the Learning Rate, Increase the Batch Size" provides several key insights and recommendations regarding the relationship between learning rate, batch size, and model performance in stochastic gradient descent (SGD) optimization
This paper provides several key insights and recommendations regarding the relationship between learning rate, batch size, and model performance in stochastic gradient descent (SGD) optimisation:
Decaying the learning rate during training is equivalent to increasing the batch size in terms of the model's performance on the test set. This is because both strategies reduce the scale of random fluctuations (noise scale) in the SGD dynamics.
Decaying the learning rate during training is equivalent to increasing the batch size in terms of the model's performance on the test set. This is because both strategies reduce the scale of random fluctuations (noise scale) in the SGD dynamics.
The noise scale is defined as , where is the training set size, is the batch size, and is the learning rate. Reducing this noise scale during training is beneficial, which can be achieved by either decaying the learning rate or increasing the batch size.
Increasing the batch size instead of decaying the learning rate can significantly reduce the number of parameter updates required to train a model, leading to shorter training times and improved computational efficiency.
The learning rate and batch size can be scaled together according to the linear scaling rule: . By increasing the learning rate and scaling the batch size accordingly, the number of parameter updates can be further reduced without sacrificing model performance.
The momentum coefficient can also be increased to reduce the number of parameter updates, by scaling the batch size as . However, this may lead to a slight reduction in test accuracy.
Stochastic Gradient Descent (SGD) is an optimisation algorithm commonly used in machine learning to update the model parameters (weights) in the direction of the negative gradient of the loss function. The key steps in SGD are:
Initialise the model parameters randomly.
For each training iteration:
a. Sample a mini-batch of examples from the training set.
b. Compute the average gradient of the loss function with respect to the model parameters over the mini-batch.
c. Update the model parameters by taking a step in the direction of the negative gradient, scaled by the learning rate.
Repeat step 2 until convergence or for a fixed number of iterations.
The learning rate determines the size of the steps taken in the parameter space during each update.
A higher learning rate leads to larger steps, while a lower learning rate results in smaller steps.
The batch size is the number of training examples used to compute the gradient in each iteration.
Larger batch sizes provide a more accurate estimate of the true gradient but require more computation per update.
The authors show that the convergence of SGD is governed by the noise scale, which depends on both the learning rate and the batch size.
By carefully adjusting these hyperparameters during training, the noise scale can be reduced, leading to faster convergence and more efficient training.
The proposed strategies of increasing the batch size and scaling the learning rate and momentum coefficient enable practitioners to significantly reduce the number of parameter updates required to train a model without sacrificing performance.
This 2023 paper addresses the challenges and opportunities of learning rate tuning in the era of Large Language Models (LLMs).
The authors argue that existing learning rate policies, primarily designed for traditional deep neural networks (DNNs), may not work well for LLM fine-tuning due to the unique characteristics of LLMs, such as high model complexity and expensive training costs.
The paper makes three main contributions:
It revisits existing learning rate policies to analyze the critical challenges of learning rate tuning for LLMs.
It presents LRBench++, a benchmarking tool for learning rate policies, to facilitate learning rate tuning for both traditional DNNs and LLMs.
It conducts experimental analyses using LRBench++ to demonstrate the key differences between LLM fine-tuning and traditional DNN training, validating their analysis.
The paper highlights an important issue in the era of LLMs - the need to reassess and adapt existing learning rate tuning strategies to the unique characteristics of LLMs.
The authors identify the key differences between LLM fine-tuning and traditional DNN training, such as the much higher model complexity (billions vs. millions of parameters), prohibitively expensive training costs, different model initialization (pre-trained vs. random), fewer training epochs, and different evaluation strategies.
The paper's contribution of LRBench++ is valuable, as it provides a benchmarking tool specifically designed for learning rate tuning, which can be used for both traditional DNNs and LLMs.
This tool can help researchers and practitioners compare and evaluate different learning rate policies more effectively.
LRBench++ is a benchmarking tool designed to facilitate learning rate tuning for both traditional deep neural networks (DNNs) and large language models (LLMs). The tool allows researchers and practitioners to evaluate and compare the performance of different learning rate policies and their impact on the training/fine-tuning process.
LRBench++ provides a unified framework for defining, implementing, and evaluating various learning rate policies, including formula-based, state-based, and exploration-based policies.
The tool integrates with popular deep learning frameworks, such as TensorFlow and PyTorch, making it easy to incorporate into existing training/fine-tuning pipelines.
LRBench++ allows users to define custom metrics for evaluating the performance of learning rate policies, such as the validation loss, accuracy, or computational cost.
The tool provides visualization capabilities to analyze the behavior of different learning rate policies during the training/fine-tuning process, helping users gain insights into the optimization paths and the impact of learning rate values on model performance.
LRBench++ also includes a collection of pre-defined learning rate policies and benchmark datasets, enabling users to quickly compare and evaluate different policies on standard tasks.
This widely cited 2017 paper, titled "On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima" by Keskar et al., investigates the phenomenon of performance degradation in deep learning models when trained with large batch sizes.
The authors aim to understand the cause of this generalization gap and provide numerical evidence to support their findings.
The authors observe that when training deep learning models with large batch sizes, there is a significant drop in the model's ability to generalise to unseen data (testing accuracy), despite achieving similar performance on the training data as models trained with small batch sizes.
The paper provides numerical evidence that suggests large-batch methods tend to converge to sharp minimisers of the training function, which are characterised by a significant number of large positive eigenvalues in the Hessian matrix.
In contrast, small-batch methods converge to flat minimisers, which have numerous small eigenvalues.
The authors use parametric plots to visualise the loss function landscape around the minimisers obtained by small-batch and large-batch methods.
These plots demonstrate that the large-batch minimisers are significantly sharper than the small-batch minimisers.
To quantify the sharpness of a minimiser, the authors propose a metric that measures the maximum value of the loss function within a small neighbourhood of the minimiser.
They use this metric to compare the sharpness of minimizers obtained by small-batch and large-batch methods, confirming that large-batch methods lead to sharper minimizers.
The paper shows that there exists a threshold for the batch size, above which there is a significant drop in the model's generalization performance.
This threshold varies depending on the network architecture and dataset.
The authors hypothesise that the noise in the stochastic gradient used by small-batch methods helps in escaping the basins of attraction of sharp minimisers, leading to convergence towards flatter minimizers that generalise better.
They support this hypothesis through experiments involving warm-starting large-batch training with iterates obtained from small-batch training.
Discussion and future directions
The paper discusses the implications of their findings and raises several questions for future research, such as proving the convergence of large-batch methods to sharp minimisers, understanding the relative density of sharp and flat minima, and designing algorithms or architectures that can steer large-batch methods away from sharp minimizers.
Throughout the paper, the authors provide extensive numerical experiments on various deep learning architectures and datasets to support their claims.
They also explore potential remedies to the generalization problem of large-batch methods, such as data augmentation, conservative training, and adversarial training, but find that these approaches do not completely solve the issue.
In conclusion, this paper sheds light on the generalization gap observed in large-batch training for deep learning and provides empirical evidence that the convergence to sharp minimizers is a primary cause of this phenomenon. The findings have significant implications for the development of efficient training methods for deep learning models, as large-batch training is crucial for leveraging parallelism and reducing training time.