Gradient Descent

Starting with initial non-optimal parameter set, it repeatedly updates the parameters in the direction of the negative gradient of the loss function until it converges to a local minimum.

See Mean Square Error#Derivation for the full derivation.

Procedure

Init $a$ and $b$ with some values
For $n$ iterations, we apply the update rules: $a \leftarrow a + \frac{2 α}{N} \sum_{i = 1}^{N} x_{i} (y_{i} - (a x_{i} + b))$ $b \leftarrow b + \frac{2 α}{N} \sum_i^{N} (y_{i} - (a x_{i} + b))$ Upon tracking changes on each iteration, if values of $a$ and $b$ are no longer changing by being less than a threshold $δ$ , we can do an early stop by breaking the GD loop.

def gradient_descent_linreg(a_0, b_0, x, y, alpha = 1e-5, delta = 1e-5, max_count = 1000):
a, b = a_0, b_0
N = len(x)
change = float("Inf")
counter = 0
losses = []

while change > delta:
D_a = -2 / N * sum([x_i * (y_i - (ax_i + b)) for x_i, y_i in zip(x, y)])
D_v = -2 / N * sum([y_i - (ax_i + b) for x_i, y_i in zip(x, y)])
a -= alpha * D_a
b -= alpha * D_b

print(f"Gradients: {D_a} {D_b}")
print(f"New values for (a, b): ({a}, {b})")
change = max(abs(alpha * D_a), abs(alpha * D_b))
print(f"Change: {change}")
loss = loss_mse(a, b, x, y)
losses.append(float(loss))
print(f"Loss: {loss}")
counter += 1
if counter > max_count:
break
return a, b, losses


Using `sklearn` library:

```python
reg = LinearRegression().fit(sk_inputs, sk_outputs)

Improving gradient descent

In the presence of local minima, the GD algorithm might converge to wrong minima (for e.g due to "wrong" starting points). This occurs when the function is not convex. To alleviate this, we can use Momentum
We can improve convergence efficiency with Learning rate decay by using a larger learning rate at the start, encouraging exploration of the loss landscape before reducing it at later training iterations for finer adjustment and preventing overshooting
When working with parameters that may require different learning rates or noisy data, we can try controlling the learning rate with Gradient-based learning rate control

Most advanced GD algorithms will combine gradient-based learning controls, learning rate decays and some momentum formula. Hyperparameters are tuned by humans.

Batch Gradient Descent

With each Backpropagation we perform 1 parameter update using all $N$ samples in the dataset. This is called (Full) Batch Gradient Descent. This is computationally intensive as:

We formulate predictions for all $N$ dataset samples
We use all $N$ predictions to compute the value of the Loss function $L$ , averaging errors over all predictions made acros all samples
Perform Backpropagation using $L$

Most optimisers like Adam optimiser uses BGD.

Smaller batch GDs

Instead of 1 update every $N$ samples, we can use 1 update per sample through Stochastic Gradient Descent.

Otherwise, we can use subsets of $N$ samples, $N^{'} < N$ , of randomly drawn samples from the dataset to perform each iteration. These are called mini-batch stochastic gradient descent.