Learning rate decay

A technique to gradually reduce Learning rate of Gradient Descent over iterations

We start with a relatively high learning rate for the model to converge faster, then reduce it as training progresses.

The idea is to improve the model to a closer solution while preventing overfitting as it prevents the model from making updates to its parameters that are too large.

One example would be to reduce the learning rate by a fixed rate at regular intervals like $α \leftarrow K α$ every $N$ iterations.

A high decay may help prevent overshooting a minimum but may result in early convergence as $α$ tends to zero faster.

Adaptive optimisers already adjust the learning rate dynamically.

Examples

Step decay
Exponential decay
Inverse time decay
Cosine annealing