Gradient-based learning rate control

This refers to adaptive methods that adjusts the learning rates dynamically based on their gradients. They modify the learning rate based on past gradients allowing for adaptive step sizes.

Gradients are used because they provide us the rate of change per iteration. As such, their magnitude indicates the extent of learning.

For instance, this is a gradient-based decay on learning rate:

α = \frac{1}{(1 + α_{c f} \sqrt{| f^{'} (x) |})}

With large gradients, we are still making significant progress and should keep the learning rate high to allow the algorithm to continue making adjustments quickly
With small gradients, we are close to converging and should only make small updates to parameters

However, there are also the inverse approach by making $α$ inversely proportional to the mean gradient value (or mean squared value).

With large gradients, $α$ is low to prevent Exploding gradients
With small gradients, $α$ is large to prevent Vanishing gradients