Gradient Descent
Starting with initial non-optimal parameter set, it repeatedly updates the parameters in the direction of the negative gradient of the loss function until it converges to a local minimum.
See Mean Square Error#Derivation for the full derivation.
Procedure
- Init
and with some values - For
iterations, we apply the update rules:
Upon tracking changes on each iteration, if values of
def gradient_descent_linreg(a_0, b_0, x, y, alpha = 1e-5, delta = 1e-5, max_count = 1000):
a, b = a_0, b_0
N = len(x)
change = float("Inf")
counter = 0
losses = []
while change > delta:
D_a = -2 / N * sum([x_i * (y_i - (a*x_i + b)) for x_i, y_i in zip(x, y)])
D_v = -2 / N * sum([y_i - (a*x_i + b) for x_i, y_i in zip(x, y)])
a -= alpha * D_a
b -= alpha * D_b
print(f"Gradients: {D_a} {D_b}")
print(f"New values for (a, b): ({a}, {b})")
change = max(abs(alpha * D_a), abs(alpha * D_b))
print(f"Change: {change}")
loss = loss_mse(a, b, x, y)
losses.append(float(loss))
print(f"Loss: {loss}")
counter += 1
if counter > max_count:
break
return a, b, losses
Using sklearn
library:
reg = LinearRegression().fit(sk_inputs, sk_outputs)
Improving gradient descent
- In the presence of local minima, the GD algorithm might converge to wrong minima (for e.g due to "wrong" starting points). This occurs when the function is not convex. To alleviate this, we can use Momentum
- We can improve convergence efficiency with Learning rate decay by using a larger learning rate at the start, encouraging exploration of the loss landscape before reducing it at later training iterations for finer adjustment and preventing overshooting
- When working with parameters that may require different learning rates or noisy data, we can try controlling the learning rate with Gradient-based learning rate control
Most advanced GD algorithms will combine gradient-based learning controls, learning rate decays and some momentum formula. Hyperparameters are tuned by humans.
Batch Gradient Descent
With each Backpropagation we perform 1 parameter update using all
- We formulate predictions for all
dataset samples - We use all
predictions to compute the value of the Loss function , averaging errors over all predictions made acros all samples - Perform Backpropagation using
Most optimisers like Adam optimiser uses BGD.
Smaller batch GDs
Instead of 1 update every
Otherwise, we can use subsets of