Gradient Descent

Starting with initial non-optimal parameter set, it repeatedly updates the parameters in the direction of the negative gradient of the loss function until it converges to a local minimum.

See Mean Square Error#Derivation for the full derivation.

Procedure

  1. Init a and b with some values
  2. For n iterations, we apply the update rules:
aa+2αNi=1Nxi(yi(axi+b))bb+2αNiN(yi(axi+b))

Upon tracking changes on each iteration, if values of a and b are no longer changing by being less than a threshold δ, we can do an early stop by breaking the GD loop.

def gradient_descent_linreg(a_0, b_0, x, y, alpha = 1e-5, delta = 1e-5, max_count = 1000):
	a, b = a_0, b_0
	N = len(x)
	change = float("Inf")
	counter = 0
	losses = []

	while change > delta:
		D_a = -2 / N * sum([x_i * (y_i - (a*x_i + b)) for x_i, y_i in zip(x, y)])
		D_v = -2 / N * sum([y_i - (a*x_i + b) for x_i, y_i in zip(x, y)])
		a -= alpha * D_a
		b -= alpha * D_b

		print(f"Gradients: {D_a} {D_b}")
		print(f"New values for (a, b): ({a}, {b})")
		change = max(abs(alpha * D_a), abs(alpha * D_b))
		print(f"Change: {change}")
		loss = loss_mse(a, b, x, y)
		losses.append(float(loss))
		print(f"Loss: {loss}")
		counter += 1
		if counter > max_count:
			break
	return a, b, losses

Using sklearn library:

reg = LinearRegression().fit(sk_inputs, sk_outputs)

Improving gradient descent

  1. In the presence of local minima, the GD algorithm might converge to wrong minima (for e.g due to "wrong" starting points). This occurs when the function is not convex. To alleviate this, we can use Momentum
  2. We can improve convergence efficiency with Learning rate decay by using a larger learning rate at the start, encouraging exploration of the loss landscape before reducing it at later training iterations for finer adjustment and preventing overshooting
  3. When working with parameters that may require different learning rates or noisy data, we can try controlling the learning rate with Gradient-based learning rate control

Most advanced GD algorithms will combine gradient-based learning controls, learning rate decays and some momentum formula. Hyperparameters are tuned by humans.

Batch Gradient Descent

With each Backpropagation we perform 1 parameter update using all N samples in the dataset. This is called (Full) Batch Gradient Descent. This is computationally intensive as:

Most optimisers like Adam optimiser uses BGD.

Smaller batch GDs

Instead of 1 update every N samples, we can use 1 update per sample through Stochastic Gradient Descent.

Otherwise, we can use subsets of N samples, N<N, of randomly drawn samples from the dataset to perform each iteration. These are called mini-batch stochastic gradient descent.