Optimiser

The choice of optimiser can impact the speed and stability of the fine-tuning process.

Popular optimisers for fine-tuning include Adam, AdamW, and SGD with momentum.

Each optimiser has its hyperparameters, such as momentum and weight decay rate, which may need to be adjusted based on the specific task and model.

The AdamW optimiser is a variant of the standard Adam optimiser, widely used in training deep neural networks. AdamW is particularly effective for large models due to its handling of weight decay.

Adam stands for "Adaptive Moment Estimation". It combines the advantages of two other popular optimizers: AdaGrad and RMSProp. Adam maintains a learning rate for each network parameter (weight) and adapts these rates based on the average of the second moments of the gradients (how fast the loss is changing) and the first moments (the direction and magnitude of the change).

What is the AdamW optimiser?

AdamW is an optimisation algorithm that is based on the popular Adam optimiser, which is commonly used for deep learning applications.

The "W" in AdamW stands for "Weight decay," which is a technique used to prevent overfitting in machine learning models.

Weight decay involves adding a regularisation term to the loss function that penalises large weight values. This helps to prevent the model from overfitting to the training data by encouraging it to learn simpler patterns that generalise better to new data.

The AdamW optimiser extends the Adam optimiser by adding weight decay directly to the update rule for the model parameters. This has been shown to improve the performance of deep learning models, particularly in cases where the data is noisy, or the model architecture is complex.

In addition to weight decay, AdamW also includes a bias correction term that is used to correct for bias in the estimates of the first and second moments of the gradients. This helps to ensure that the optimiser can converge more quickly and reliably to the optimal solution.

Overall, the AdamW optimiser is a powerful tool for optimising deep learning models, and it has been shown to be effective in a wide range of applications. It is commonly used for tasks such as image classification, object detection, and natural language processing.

PreviousWeight Decay NextCaching

Last updated 1 year ago

Was this helpful?