Learning Rate Scheduler

Key Considerations with Learning Rate Scheduling in Neural Network Training

What is the Learning Rate in Deep Learning?

Neural networks have many hyperparameters that affect the model’s performance.

One of the essential hyperparameters is the learning rate (LR), which determines how much the model weights change between training steps. In the simplest case, the LR value is a fixed value between 0 and 1.

However, choosing the correct LR value can be challenging.

On the one hand, a large learning rate can help the algorithm to converge quickly. But it can also cause the algorithm to bounce around the minimum without reaching it or even jumping over it if it is too large.

On the other hand, a small learning rate can converge better to the minimum. However, the optimizer may take too long to converge or get stuck in a plateau if it is too small.

Importance of Learning Rate

The learning rate is a vital hyperparameter in neural network training. It determines the size of the steps taken during the optimisation process and can significantly impact the convergence and performance of the model.

Dynamic Adjustment via Scheduling

Instead of using a fixed learning rate, it's common to adjust it dynamically during training. This approach, known as learning rate scheduling, adapts the learning rate based on certain criteria or over time.

Warmup Period

Many learning rate schedules start with a warmup period. During this phase, the learning rate increases linearly from a lower initial value to the base learning rate. The warmup period helps in stabilising the training process early on.

Types of Schedules – Cosine Scheduler

One popular method is the cosine scheduler, which adjusts the learning rate following a cosine curve. After the warmup period, the learning rate decreases following a cosine pattern, which can help in finer convergence and potentially avoid local minima.

Configuration and Implementation

To implement learning rate scheduling, one needs to define a schedule function, which takes into account the total number of training epochs, warmup periods, and other hyperparameters like base learning rate.

This scheduling function is then integrated with the optimizer used in the training process.

Impact on Training Loop

The training loop must accommodate the dynamic changes in the learning rate. This involves updating the learning rate at each step or epoch according to the schedule.
Adjustments in the training function and other related components (like state initialisation) are necessary to ensure that the dynamically changing learning rate is applied correctly.

Influence on Training Dynamics and Performance

A well-designed learning rate schedule can lead to faster convergence, better generalization, and improved overall performance of the model.
The choice of schedule and its parameters should be tailored to the specific characteristics of the training data and the neural network architecture.

In summary, learning rate scheduling is a sophisticated technique to enhance the training of deep neural networks.

It involves starting with a warm-up phase followed by a dynamic adjustment of the learning rate, often following specific patterns like a cosine curve.

This approach requires careful integration into the training loop and has a significant impact on the model's learning dynamics and eventual performance.

A learning rate schedule is used to adjust the learning rate during training dynamically.

Common schedules include stepwise decay, exponential decay, and cosine annealing.

Fine-tuning often benefits from a learning rate schedule that reduces the learning rate over time, ensuring that the model converges to a good solution.

Types of Learning Rate Schedulers

One solution to help the algorithm converge quickly to an optimum is to use a learning rate scheduler.

A learning rate scheduler adjusts the learning rate according to a predefined schedule during the training process.

Usually, the learning rate is set to a higher value at the beginning of the training to allow faster convergence. As the training progresses, the learning rate is reduced to enable convergence to the optimum and thus leading to better performance. Reducing the learning rate over the training process is also known as annealing or decay.

The amount of different learning rate schedulers can be overwhelming.

The documentation below aims to give you an overview of how different pre-defined learning rate schedulers in PyTorch adjust the learning rate during training.

You can read more in the PyTorch documentation for more details on the learning rate schedulers.

Learning Rate Adjustment in PyTorch

Learning rate is a crucial hyperparameter in deep learning that determines the step size at which the model's weights are updated during optimisation.

Adjusting the learning rate throughout the training process can significantly impact the model's convergence and performance.

PyTorch provides various learning rate schedulers in the torch.optim.lr_scheduler module to dynamically adjust the learning rate based on different strategies.

General Guidelines

When using learning rate schedulers in PyTorch, it's important to follow these general guidelines:

Apply the learning rate scheduler after the optimser's update step. This ensures that the learning rate is adjusted based on the updated model weights.
Chain multiple schedulers together to combine their effects. Each scheduler is applied one after the other on the learning rate obtained by the preceding scheduler.

Adjusting Learning Rate

PyTorch offers several learning rate schedulers that adjust the learning rate based on the number of epochs or iterations.

Here are a few commonly used schedulers:

ExponentialLR

The ExponentialLR scheduler exponentially decays the learning rate by a factor of gamma every epoch.

Here's an example:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = ExponentialLR(optimizer, gamma=0.9)

for epoch in range(num_epochs):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

In this example, the learning rate starts at 0.01 and is multiplied by gamma (0.9) after each epoch. So, the learning rate decays exponentially over time.

MultiStepLR

The MultiStepLR scheduler decays the learning rate by a factor of gamma at specified milestones during training. Here's an example:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)

for epoch in range(num_epochs):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler.step()

In this example, the learning rate starts at 0.01 and is multiplied by gamma (0.1) at epochs 30 and 80. This allows for a step-wise decay of the learning rate at specific points during training.

ReduceLROnPlateau

The ReduceLROnPlateau scheduler reduces the learning rate when a specified metric (e.g., validation loss) has stopped improving.

This is useful when the model's performance plateaus during training. Here's an example:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10)

for epoch in range(num_epochs):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    
    val_loss = validate(model)
    scheduler.step(val_loss)

In this example, the ReduceLROnPlateau scheduler monitors the validation loss. If the validation loss does not improve for patience (10) epochs, the learning rate is reduced by a factor of factor (0.1). This helps the model to fine-tune its parameters when it reaches a plateau.

Chaining Schedulers

PyTorch allows you to chain multiple learning rate schedulers together to combine their effects.

Each scheduler is applied one after the other on the learning rate obtained by the preceding scheduler. Here's an example:

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
scheduler1 = ExponentialLR(optimizer, gamma=0.9)
scheduler2 = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1)

for epoch in range(num_epochs):
    for input, target in dataset:
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        optimizer.step()
    scheduler1.step()
    scheduler2.step()

In this example, the learning rate is first adjusted by the ExponentialLR scheduler, and then the resulting learning rate is further adjusted by the MultiStepLR scheduler. This allows for more complex learning rate scheduling strategies.

Conclusion

Learning rate adjustment is a powerful technique to optimize the training process of deep learning models.

PyTorch provides various learning rate schedulers in the torch.optim.lr_scheduler module, allowing you to dynamically adjust the learning rate based on different strategies such as exponential decay, step-wise decay, or plateau-based reduction.

By following the general guidelines of applying schedulers after the optimizer's update step and chaining multiple schedulers together, you can fine-tune the learning rate throughout the training process to improve model convergence and performance.

Remember to experiment with different learning rate scheduling strategies and hyperparameters to find the optimal configuration for your specific problem and model architecture.

Detailed Review of LR Schedulers

StepLR

StepLR is a learning rate scheduler in PyTorch that decays the learning rate by a fixed factor (gamma) every specified number of epochs (step_size).

It is commonly used to reduce the learning rate at regular intervals during training.

How it works

The StepLR scheduler is initialised with the following parameters:
- optimizer: The optimizer whose learning rate will be adjusted.
- step_size: The number of epochs after which the learning rate will be decayed.
- gamma: The factor by which the learning rate will be multiplied at each decay step. Default is 0.1.
- last_epoch: The index of the last epoch. Default is -1.
During training, after each epoch, you call the step() method of the scheduler to update the learning rate.
The scheduler checks if the current epoch is a multiple of step_size. If it is, the learning rate of each parameter group in the optimizer is multiplied by gamma.
The updated learning rate is used for the next epoch.

Example

Here's an example of how to use the StepLR scheduler:

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import StepLR

# Create your model and optimizer
model = ...
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Create the StepLR scheduler
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Training loop
for epoch in range(100):
    # Train your model
    train(...)
    
    # Validate your model
    validate(...)
    
    # Update the learning rate
    scheduler.step()

In this example:

The initial learning rate is set to 0.1.
The StepLR scheduler is created with step_size=30 and gamma=0.1.
During training, after every 30 epochs, the learning rate will be multiplied by 0.1.
So, the learning rate will be:
- 0.1 for epochs 0 to 29
- 0.01 for epochs 30 to 59
- 0.001 for epochs 60 to 89
- ...and so on

Conclusion

The StepLR scheduler is a simple yet effective way to decay the learning rate at regular intervals during training. By adjusting the learning rate, it can help improve the convergence and generalization of your model. Experiment with different step_size and gamma values to find the optimal settings for your specific problem.

ConstantLR in PyTorch

The ConstantLR scheduler in PyTorch is an unusual learning rate scheduler that multiplies the learning rate of each parameter group by a small constant factor until a pre-defined milestone (total_iters) is reached.

After reaching the milestone, the learning rate remains constant for the rest of the training.

Why Use a Learning Rate Scheduler That Goes Up?

In most cases, learning rate schedulers are used to gradually decrease the learning rate over the course of training.

This is based on the idea that as the model converges towards a minimum, smaller learning rates allow for fine-tuning and prevent overshooting the optimal solution.

However, there are some scenarios where increasing the learning rate can be beneficial:

Warmup: In some cases, starting with a very low learning rate and gradually increasing it can help the model converge faster and reach a better solution. This is known as a warmup phase. The ConstantLR scheduler can be used to implement a warmup by setting the factor to a value less than 1 and total_iters to the number of warmup steps.

Escaping Local Minima: If a model gets stuck in a suboptimal local minimum during training, increasing the learning rate can help it escape and explore other regions of the parameter space. By temporarily increasing the learning rate, the model can potentially jump out of the local minimum and find a better solution.

Cyclical Learning Rates: Some advanced learning rate scheduling techniques involve alternating between high and low learning rates in a cyclical manner. The idea is that the high learning rates allow for exploration, while the low learning rates allow for fine-tuning. The ConstantLR scheduler can be used as a building block to create such cyclical schedules.

How ConstantLR Works

The ConstantLR scheduler is initialized with the following parameters:
- optimizer: The optimizer whose learning rate will be adjusted.
- factor: The constant factor by which the learning rate will be multiplied until the milestone. Default is 1./3.
- total_iters: The number of steps (epochs) for which the learning rate will be multiplied by the factor. Default is 5.
- last_epoch: The index of the last epoch. Default is -1.
During training, after each epoch, you call the step() method of the scheduler to update the learning rate.
If the current epoch is less than total_iters, the learning rate of each parameter group in the optimizer is multiplied by the factor.
Once the current epoch reaches total_iters, the learning rate multiplication stops, and the learning rate remains constant for the subsequent epochs.

Example

Here's an example of how to use the ConstantLR scheduler:

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ConstantLR

# Create your model and optimizer
model = ...
optimizer = optim.SGD(model.parameters(), lr=0.05)

# Create the ConstantLR scheduler
scheduler = ConstantLR(optimizer, factor=0.5, total_iters=4)

# Training loop
for epoch in range(100):
    # Train your model
    train(...)
    
    # Validate your model
    validate(...)
    
    # Update the learning rate
    scheduler.step()

In this example:

The initial learning rate is set to 0.05.
The ConstantLR scheduler is created with factor=0.5 and total_iters=4.
During training:
- For epochs 0 to 3, the learning rate is multiplied by the factor (0.5), reducing it to 0.025.
- For epochs 4 and beyond, the learning rate remains constant at 0.05.

Conclusion

The ConstantLR scheduler in PyTorch is an unusual learning rate scheduler that multiplies the learning rate by a constant factor until a pre-defined milestone is reached. While decreasing learning rates are more common, there are scenarios where increasing the learning rate can be beneficial, such as warmup phases, escaping local minima, or implementing cyclical learning rate schedules.

When using the ConstantLR scheduler, it's important to carefully choose the factor and total_iters values based on your specific problem and training dynamics. Experimentation and monitoring of the model's performance are crucial to determine if increasing the learning rate is indeed beneficial for your task.

Keep in mind that the ConstantLR scheduler is just one tool in the toolbox of learning rate scheduling techniques.

It can be combined with other schedulers or used as a building block for more complex scheduling strategies. The choice of learning rate scheduler depends on the characteristics of your problem, the model architecture, and the desired training behavior.

CosineAnnealingLR in PyTorch

The CosineAnnealingLR scheduler in PyTorch adjusts the learning rate of each parameter group using a cosine annealing schedule. It is based on the idea of gradually decreasing the learning rate over the course of training, following a cosine function.

Why CosineAnnealingLR is Popular

The CosineAnnealingLR scheduler has gained popularity for several reasons:

Smooth Learning Rate Decay: The cosine annealing schedule provides a smooth and gradual decrease in the learning rate. This allows the model to fine-tune its parameters as it approaches the end of training, potentially leading to better convergence and generalization.

Improved Convergence: By gradually reducing the learning rate, the CosineAnnealingLR scheduler helps the model converge to a good solution. The decreasing learning rate allows the model to take smaller steps towards the minimum of the loss function, reducing the risk of overshooting or oscillating around the minimum.

Automatic Learning Rate Adjustment: The CosineAnnealingLR scheduler automatically adjusts the learning rate based on the number of iterations or epochs. This eliminates the need for manual learning rate tuning and makes it easier to use in practice.

Cyclic Learning Rates: The cosine annealing schedule can be extended to implement cyclic learning rates. By periodically resetting the learning rate to a higher value and then annealing it again, the model can escape from local minima and explore different regions of the parameter space. This can lead to better generalization and robustness.

How CosineAnnealingLR Works

The CosineAnnealingLR scheduler adjusts the learning rate based on the following equation:

η_t = η_min + (1/2) * (η_max - η_min) * (1 + cos(T_cur / T_max * π))

Where:

η_t is the learning rate at the current iteration t.
η_min is the minimum learning rate (specified by the eta_min argument).
η_max is the initial learning rate (set to the learning rate of the optimizer).
T_cur is the number of iterations since the last restart.
T_max is the maximum number of iterations (specified by the T_max argument).

The scheduler works as follows

The learning rate starts at η_max and gradually decreases following the cosine function.
At each iteration, the learning rate is updated based on the above equation.
When T_cur reaches T_max, the learning rate is reset to η_max, and the cycle repeats.

The CosineAnnealingLR scheduler provides a smooth and periodic decay of the learning rate, allowing the model to fine-tune its parameters and potentially escape from suboptimal solutions.

Example Usage

Here's an example of how to use the CosineAnnealingLR scheduler in PyTorch:

import torch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

# Create your model and optimizer
model = ...
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Create the CosineAnnealingLR scheduler
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.001)

# Training loop
for epoch in range(100):
    # Train your model
    train(...)
    
    # Update the learning rate
    scheduler.step()

In this example, the initial learning rate is set to 0.1, and the CosineAnnealingLR scheduler is created with T_max=100 and eta_min=0.001.

During training, the learning rate will be adjusted based on the cosine annealing schedule, starting from 0.1 and gradually decreasing towards 0.001 over the course of 100 iterations.

Conclusion

The CosineAnnealingLR scheduler is a popular choice for adjusting the learning rate during training due to its smooth and periodic decay, improved convergence, and automatic learning rate adjustment.

By gradually reducing the learning rate following a cosine function, it allows the model to fine-tune its parameters and potentially escape from suboptimal solutions.

When using the CosineAnnealingLR scheduler, it's important to choose appropriate values for T_max and eta_min based on your specific problem and training dynamics.

Experimentation and monitoring of the model's performance are crucial to determine the optimal settings.

Keep in mind that the CosineAnnealingLR scheduler is just one of many learning rate scheduling techniques available in PyTorch.

Depending on your problem and desired training behaviour, other schedulers like StepLR, MultiStepLR, or custom schedulers may be more suitable.

PreviousA process for choosing the learning rate NextCheckpoints

Last updated 2 months ago

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) scheduler = ExponentialLR(optimizer, gamma=0.9) for epoch in range(num_epochs): for input, target in dataset: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step() scheduler.step()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) scheduler = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1) for epoch in range(num_epochs): for input, target in dataset: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step() scheduler.step()

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10) for epoch in range(num_epochs): for input, target in dataset: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step() val_loss = validate(model) scheduler.step(val_loss)

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) scheduler1 = ExponentialLR(optimizer, gamma=0.9) scheduler2 = MultiStepLR(optimizer, milestones=[30, 80], gamma=0.1) for epoch in range(num_epochs): for input, target in dataset: optimizer.zero_grad() output = model(input) loss = loss_fn(output, target) loss.backward() optimizer.step() scheduler1.step() scheduler2.step()

import torch import torch.optim as optim from torch.optim.lr_scheduler import StepLR # Create your model and optimizer model = ... optimizer = optim.SGD(model.parameters(), lr=0.1) # Create the StepLR scheduler scheduler = StepLR(optimizer, step_size=30, gamma=0.1) # Training loop for epoch in range(100): # Train your model train(...) # Validate your model validate(...) # Update the learning rate scheduler.step()

import torch import torch.optim as optim from torch.optim.lr_scheduler import ConstantLR # Create your model and optimizer model = ... optimizer = optim.SGD(model.parameters(), lr=0.05) # Create the ConstantLR scheduler scheduler = ConstantLR(optimizer, factor=0.5, total_iters=4) # Training loop for epoch in range(100): # Train your model train(...) # Validate your model validate(...) # Update the learning rate scheduler.step()

import torch import torch.optim as optim from torch.optim.lr_scheduler import CosineAnnealingLR # Create your model and optimizer model = ... optimizer = optim.SGD(model.parameters(), lr=0.1) # Create the CosineAnnealingLR scheduler scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=0.001) # Training loop for epoch in range(100): # Train your model train(...) # Update the learning rate scheduler.step()