Backpropagation

An algorithm used to train Neural Networks

This is a type of Gradient Descent that allows the network to learn by adjusting the weights of the connections.

It will tell our how to adjust weights and bias matrices.

Solving this optimisation problem with GD:

\begin{aligned} W_{1}^{*}, b_{1}^{*}, W_{2}^{*}, b_{2}^{*} & = \arg min_{W_{1}, b_{1}, W_{2}, b_{2}} [\frac{1}{M} (Y_{p r e d} (X, W_{1}, b_{1}, W_{2}, b_{2}) - Y)^{2}] \\ = \arg min_{W_{1}, b_{1}, W_{2}, b_{2}} [\frac{1}{M} (W_{2} (W_{1} X + b_{1}) + b_{2} - Y)^{2}] \end{aligned}

Denoting the error term:

ϵ = Y_{p r e d} (X, W_{1}, b_{1}, W_{2}, b_{2}) - Y = W_{2} (W_{1} X + b_{1}) + b_{2} - Y

At the same time, we can compute the differential of the loss function with respect to the weights by the chain rule:

\frac{\partial L}{\partial W_{2}} = \frac{\partial L}{\partial Y_{p r e d}} \frac{\partial Y_{p r e d}}{\partial W_{2}}

\frac{\partial L}{\partial Y_{p r e d}} = \frac{2}{M} Y_{p r e d} (X, W_{1}, b_{1}, W_{2}, b_{2}) - Y = \frac{2 ϵ}{M}

\frac{\partial Y_{p r e d}}{\partial W_{2}} = W_{1} X + b_{1}

Hence, the gradient descent update rule for $W_{2}$ is:

W_{2} \leftarrow W_{2} - \frac{2 ϵ α}{M} (W_{1} X + b_{1})

b_{2} \leftarrow b_{2} - \frac{2 ϵ α}{M}

where $α$ is the learning rate

RECALL:

The gradient update rule is $W_{2} \leftarrow W_{2} - α * \frac{δ L}{δ W_{2}}$
As such, $\frac{δ L}{δ W_{2}} = \frac{2 ϵ}{M} * (W_{1} X + b)$
Without the learning rate, we would be taking a full step in the direction of the negative gradient, which is only an approximation of the slope of the loss function at the current point

\begin{aligned} \frac{\partial L}{\partial W_{1}} & = \frac{\partial L}{\partial Y_{p r e d}} \frac{\partial Y_{p r e d}}{\partial W_{1}} = (\frac{2 ϵ}{M}) (W_{2} X) \\ \frac{\partial L}{\partial b_{1}} & = \frac{\partial L}{\partial Y_{p r e d}} \frac{\partial Y_{p r e d}}{\partial b_{1}} = (\frac{2 ϵ}{M}) (W_{2}) \end{aligned}

\begin{aligned} W_{1} & \leftarrow W_{1} - \frac{2 ϵ α}{M} W_{2} X \\ b_{1} & \leftarrow b_{1} - \frac{2 ϵ α}{M} W_{2} \end{aligned}