# Learning Rate Scheduler

Key Considerations with Learning Rate Scheduling in Neural Network Training

### What is the Learning Rate in Deep Learning?

Neural networks have many hyperparameters that affect the model’s performance.

One of the essential hyperparameters is the learning rate (LR), which * determines how much the model weights change between training steps*. In the simplest case, the LR value is a fixed value between 0 and 1.

However, choosing the correct LR value can be challenging.

On the one hand, a large learning rate can help the algorithm to converge quickly. But it can also cause the algorithm to bounce around the minimum without reaching it or even jumping over it if it is too large.

On the other hand, a small learning rate can converge better to the minimum. However, the optimizer may take too long to converge or get stuck in a plateau if it is too small.

**Importance of Learning Rate**

**Importance of Learning Rate**

The learning rate is a vital hyperparameter in neural network training. It determines the size of the steps taken during the optimisation process and can significantly impact the convergence and performance of the model.

**Dynamic Adjustment via Scheduling**

Instead of using a fixed learning rate, it's common to * adjust it dynamically during training.* This approach, known as learning rate scheduling, adapts the learning rate based on certain criteria or over time.

**Warmup Period**

Many learning rate schedules start with a warmup period. During this phase, the learning rate increases linearly from a lower initial value to the base learning rate. The warmup period helps in stabilising the training process early on.

**Types of Schedules – Cosine Scheduler**

One popular method is the cosine scheduler, which adjusts the learning rate following a cosine curve. After the warmup period, the learning rate decreases following a cosine pattern, which can help in finer convergence and potentially avoid local minima.

**Configuration and Implementation**

To implement learning rate scheduling, one needs to define a schedule function, which takes into account the total number of training epochs, warmup periods, and other hyperparameters like base learning rate.

This scheduling function is then integrated with the optimizer used in the training process.

**Impact on Training Loop**

The training loop must accommodate the dynamic changes in the learning rate. This involves updating the learning rate at each step or epoch according to the schedule.

Adjustments in the training function and other related components (like state initialisation) are necessary to ensure that the dynamically changing learning rate is applied correctly.

**Influence on Training Dynamics and Performance**

A well-designed learning rate schedule can lead to faster convergence, better generalization, and improved overall performance of the model.

The choice of schedule and its parameters should be tailored to the specific characteristics of the training data and the neural network architecture.

In summary, learning rate scheduling is a sophisticated technique to enhance the training of deep neural networks.

It involves starting with a warm-up phase followed by a dynamic adjustment of the learning rate, often following specific patterns like a cosine curve.

This approach requires careful integration into the training loop and has a significant impact on the model's learning dynamics and eventual performance.

A learning rate schedule is used to adjust the learning rate during training dynamically.

Common schedules include stepwise decay, exponential decay, and cosine annealing.

Fine-tuning often benefits from a learning rate schedule that reduces the learning rate over time, ensuring that the model converges to a good solution.

### Types of Learning Rate Schedulers

One solution to help the algorithm converge quickly to an optimum is to use a **learning rate scheduler.**

A learning rate scheduler adjusts the learning rate according to a predefined schedule during the training process.

Usually, the learning rate is set to a higher value at the beginning of the training to allow faster convergence. As the training progresses, the learning rate is reduced to enable convergence to the optimum and thus leading to better performance. Reducing the learning rate over the training process is also known as annealing or decay.

The amount of different learning rate schedulers can be overwhelming.

The documentation below aims to give you an overview of how different pre-defined learning rate schedulers in PyTorch adjust the learning rate during training.

You can read more in the PyTorch documentation for more details on the learning rate schedulers.

### Learning Rate Adjustment in PyTorch

Learning rate is a crucial hyperparameter in deep learning that determines the step size at which the model's weights are updated during optimisation.

Adjusting the learning rate throughout the training process can significantly impact the model's convergence and performance.

PyTorch provides various learning rate schedulers in the **torch.optim.lr_scheduler**** **module to dynamically adjust the learning rate based on different strategies.

### General Guidelines

When using learning rate schedulers in PyTorch, it's important to follow these general guidelines:

Apply the learning rate scheduler after the optimser's update step. This ensures that the learning rate is adjusted based on the updated model weights.

Chain multiple schedulers together to combine their effects. Each scheduler is applied one after the other on the learning rate obtained by the preceding scheduler.

### Adjusting Learning Rate

PyTorch offers several learning rate schedulers that adjust the learning rate based on the number of epochs or iterations.

Here are a few commonly used schedulers:

#### ExponentialLR

The

scheduler exponentially decays the learning rate by a factor of **ExponentialLR**

every epoch. **gamma**

Here's an example:

In this example, the learning rate starts at 0.01 and is multiplied by `gamma`

(0.9) after each epoch. So, the learning rate decays exponentially over time.

#### MultiStepLR

The `MultiStepLR`

scheduler decays the learning rate by a factor of `gamma`

at specified milestones during training. Here's an example:

In this example, the learning rate starts at 0.01 and is multiplied by `gamma`

(0.1) at epochs 30 and 80. This allows for a step-wise decay of the learning rate at specific points during training.

#### ReduceLROnPlateau

The

scheduler reduces the learning rate when a specified metric (e.g., validation loss) has stopped improving. **ReduceLROnPlateau**

This is useful when the model's performance plateaus during training. Here's an example:

In this example, the `ReduceLROnPlateau`

scheduler monitors the validation loss. If the validation loss does not improve for `patience`

(10) epochs, the learning rate is reduced by a factor of `factor`

(0.1). This helps the model to fine-tune its parameters when it reaches a plateau.

### Chaining Schedulers

PyTorch allows you to chain multiple learning rate schedulers together to combine their effects.

Each scheduler is applied one after the other on the learning rate obtained by the preceding scheduler. Here's an example:

In this example, the learning rate is first adjusted by the

scheduler, and then the resulting learning rate is further adjusted by the **ExponentialLR**

scheduler. This allows for more complex learning rate scheduling strategies.**MultiStepLR**

### Conclusion

Learning rate adjustment is a powerful technique to optimize the training process of deep learning models.

PyTorch provides various learning rate schedulers in the

module, allowing you to dynamically adjust the learning rate based on different strategies such as exponential decay, step-wise decay, or plateau-based reduction.**torch.optim.lr_scheduler**

By following the general guidelines of applying schedulers after the optimizer's update step and chaining multiple schedulers together, you can fine-tune the learning rate throughout the training process to improve model convergence and performance.

Remember to experiment with different learning rate scheduling strategies and hyperparameters to find the optimal configuration for your specific problem and model architecture.

### Detailed Review of LR Schedulers

### StepLR

`StepLR`

is a learning rate scheduler in PyTorch that decays the learning rate by a fixed factor (gamma) every specified number of epochs (step_size).

It is commonly used to reduce the learning rate at regular intervals during training.

#### How it works

The

`StepLR`

scheduler is initialised with the following parameters:`optimizer`

: The optimizer whose learning rate will be adjusted.`step_size`

: The number of epochs after which the learning rate will be decayed.`gamma`

: The factor by which the learning rate will be multiplied at each decay step. Default is 0.1.`last_epoch`

: The index of the last epoch. Default is -1.

During training, after each epoch, you call the

`step()`

method of the scheduler to update the learning rate.The scheduler checks if the current epoch is a multiple of

`step_size`

. If it is, the learning rate of each parameter group in the optimizer is multiplied by`gamma`

.The updated learning rate is used for the next epoch.

#### Example

Here's an example of how to use the `StepLR`

scheduler:

In this example:

The initial learning rate is set to 0.1.

The

`StepLR`

scheduler is created with`step_size=30`

and`gamma=0.1`

.During training, after every 30 epochs, the learning rate will be multiplied by 0.1.

So, the learning rate will be:

0.1 for epochs 0 to 29

0.01 for epochs 30 to 59

0.001 for epochs 60 to 89

...and so on

### Conclusion

The `StepLR`

scheduler is a simple yet effective way to decay the learning rate at regular intervals during training. By adjusting the learning rate, it can help improve the convergence and generalization of your model. Experiment with different `step_size`

and `gamma`

values to find the optimal settings for your specific problem.

### ConstantLR in PyTorch

The

scheduler in PyTorch is an unusual learning rate scheduler that multiplies the learning rate of each parameter group by a small constant factor until a pre-defined milestone (total_iters) is reached.**ConstantLR**

After reaching the milestone, the learning rate remains constant for the rest of the training.

#### Why Use a Learning Rate Scheduler That Goes Up?

In most cases, learning rate schedulers are used to gradually decrease the learning rate over the course of training.

This is based on the idea that as the model converges towards a minimum, smaller learning rates allow for fine-tuning and prevent overshooting the optimal solution.

However, there are some scenarios where increasing the learning rate can be beneficial:

**Warmup: **In some cases, starting with a very low learning rate and gradually increasing it can help the model converge faster and reach a better solution. This is known as a warmup phase. The `ConstantLR`

scheduler can be used to implement a warmup by setting the `factor`

to a value less than 1 and `total_iters`

to the number of warmup steps.

**Escaping Local Minima:** If a model gets stuck in a suboptimal local minimum during training, increasing the learning rate can help it escape and explore other regions of the parameter space. By temporarily increasing the learning rate, the model can potentially jump out of the local minimum and find a better solution.

**Cyclical Learning Rates:** Some advanced learning rate scheduling techniques involve alternating between high and low learning rates in a cyclical manner. The idea is that the high learning rates allow for exploration, while the low learning rates allow for fine-tuning. The `ConstantLR`

scheduler can be used as a building block to create such cyclical schedules.

### How ConstantLR Works

The

`ConstantLR`

scheduler is initialized with the following parameters:`optimizer`

: The optimizer whose learning rate will be adjusted.`factor`

: The constant factor by which the learning rate will be multiplied until the milestone. Default is 1./3.`total_iters`

: The number of steps (epochs) for which the learning rate will be multiplied by the factor. Default is 5.`last_epoch`

: The index of the last epoch. Default is -1.

During training, after each epoch, you call the

`step()`

method of the scheduler to update the learning rate.If the current epoch is less than

`total_iters`

, the learning rate of each parameter group in the optimizer is multiplied by the`factor`

.Once the current epoch reaches

`total_iters`

, the learning rate multiplication stops, and the learning rate remains constant for the subsequent epochs.

#### Example

Here's an example of how to use the `ConstantLR`

scheduler:

In this example:

The initial learning rate is set to 0.05.

The

`ConstantLR`

scheduler is created with`factor=0.5`

and`total_iters=4`

.During training:

For epochs 0 to 3, the learning rate is multiplied by the factor (0.5), reducing it to 0.025.

For epochs 4 and beyond, the learning rate remains constant at 0.05.

### Conclusion

The **ConstantLR**** **scheduler in PyTorch is an unusual learning rate scheduler that multiplies the learning rate by a constant factor until a pre-defined milestone is reached. While decreasing learning rates are more common, there are scenarios where increasing the learning rate can be beneficial, such as warmup phases, escaping local minima, or implementing cyclical learning rate schedules.

When using the

scheduler, it's important to carefully choose the**ConstantLR**** **

and **factor**

values based on your specific problem and training dynamics. Experimentation and monitoring of the model's performance are crucial to determine if increasing the learning rate is indeed beneficial for your task.**total_iters**

Keep in mind that the

scheduler is just one tool in the toolbox of learning rate scheduling techniques. **ConstantLR**

It can be combined with other schedulers or used as a building block for more complex scheduling strategies. The choice of learning rate scheduler depends on the characteristics of your problem, the model architecture, and the desired training behavior.

### CosineAnnealingLR in PyTorch

The

scheduler in PyTorch adjusts the learning rate of each parameter group using a cosine annealing schedule. It is based on the idea of gradually decreasing the learning rate over the course of training, following a cosine function.**CosineAnnealingLR**

#### Why CosineAnnealingLR is Popular

The `CosineAnnealingLR`

scheduler has gained popularity for several reasons:

**Smooth Learning Rate Decay: **The cosine annealing schedule provides a smooth and gradual decrease in the learning rate. This allows the model to fine-tune its parameters as it approaches the end of training, potentially leading to better convergence and generalization.

**Improved Convergence: **By gradually reducing the learning rate, the `CosineAnnealingLR`

scheduler helps the model converge to a good solution. The decreasing learning rate allows the model to take smaller steps towards the minimum of the loss function, reducing the risk of overshooting or oscillating around the minimum.

**Automatic Learning Rate Adjustment: **The `CosineAnnealingLR`

scheduler automatically adjusts the learning rate based on the number of iterations or epochs. This eliminates the need for manual learning rate tuning and makes it easier to use in practice.

**Cyclic Learning Rates: **The cosine annealing schedule can be extended to implement cyclic learning rates. By periodically resetting the learning rate to a higher value and then annealing it again, the model can escape from local minima and explore different regions of the parameter space. This can lead to better generalization and robustness.

### How CosineAnnealingLR Works

The **CosineAnnealingLR**** **scheduler adjusts the learning rate based on the following equation:

Where:

`η_t`

is the learning rate at the current iteration`t`

.`η_min`

is the minimum learning rate (specified by the`eta_min`

argument).`η_max`

is the initial learning rate (set to the learning rate of the optimizer).`T_cur`

is the number of iterations since the last restart.`T_max`

is the maximum number of iterations (specified by the`T_max`

argument).

#### The scheduler works as follows

The learning rate starts at

`η_max`

and gradually decreases following the cosine function.At each iteration, the learning rate is updated based on the above equation.

When

`T_cur`

reaches`T_max`

, the learning rate is reset to`η_max`

, and the cycle repeats.

The `CosineAnnealingLR`

scheduler provides a smooth and periodic decay of the learning rate, allowing the model to fine-tune its parameters and potentially escape from suboptimal solutions.

### Example Usage

Here's an example of how to use the

scheduler in PyTorch:**CosineAnnealingLR**

In this example, the initial learning rate is set to 0.1, and the `CosineAnnealingLR`

scheduler is created with `T_max=100`

and `eta_min=0.001`

.

During training, the learning rate will be adjusted based on the cosine annealing schedule, starting from 0.1 and gradually decreasing towards 0.001 over the course of 100 iterations.

### Conclusion

The

scheduler is a popular choice for adjusting the learning rate during training due to its smooth and periodic decay, improved convergence, and automatic learning rate adjustment. **CosineAnnealingLR**

By gradually reducing the learning rate following a cosine function, it allows the model to fine-tune its parameters and potentially escape from suboptimal solutions.

When using the

scheduler, it's important to choose appropriate values for **CosineAnnealingLR**`T_max`

and `eta_min`

based on your specific problem and training dynamics.

Experimentation and monitoring of the model's performance are crucial to determine the optimal settings.

Keep in mind that the

scheduler is just one of many learning rate scheduling techniques available in PyTorch. **CosineAnnealingLR**

Depending on your problem and desired training behaviour, other schedulers like `StepLR`

, `MultiStepLR`

, or custom schedulers may be more suitable.

Last updated