Adam optimiser

Adaptive Moment estimation: Very widely used optimiser that combines Momentum and RMSProp. Achieves good performance with relatively little Hyperparameter tuning

We combine two running averages:

Exponentially decaying running averages of past gradients
Exponentially decaying running averages of past squared parameters

First Moment Estimate $m_{t}$

This represents the mean of gradients and is similar to Standard momentum as it accumulates past gradients in an exponentially weighted moving average.

This allows Adam to accelerate learning in relevant directions.

% Lecture notes represent first moment estimate ("first parameter") as $V$ instead of $m_{t}$

\begin{aligned} V_{w_{2}} & = (1 - β_{1}) \frac{\partial L}{\partial W_{2}} + β_{1} V_{w_{2}} \\ m_{t} & = (1 - β_{1}) g_{t} + β_{1} m_{t - 1} \end{aligned}

where $g_{t}$ is the gradient at time step $t$ or we can use $\frac{δ L}{δ W_{2}}$

$β_{1}$ is typically 0.9, a decay rate that controls how much past gradients influence the current update

Second Moment Estimate $v_{t}$

This represents variance of gradients and is similar to RMSProp, scaling the learning rate based on magnitude of past gradients.

This allows parameters with large gradients to get smaller updates, and smaller gradients to get larger updates. We get best of both worlds.

\begin{aligned} S_{w_{2}} & = (1 - β_{2}) {(\frac{\partial L}{\partial W_{2}})}^{2} + β_{2} S_{w_{2}} \\ v_{t} & = (1 - β_{2}) g_{t}^{2} + β_{2} v_{t - 1} \end{aligned}

$β_{2}$ is typically 0.999

Parameter Update

Again, we include $ϵ = 1 e^{- 6}$ to prevent divisions by zero, and remove it from the square root.

C_{w_{2}} = α \frac{V_{w_{2}}}{\sqrt{S_{w_{2}}} + ϵ}

W_{2} \leftarrow W_{2} - C_{w_{2}}

! This lecture notes version has no indication of Bias correction!
? Why remove the $ϵ$ from the square root?

grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

# Momentum and gradient decay/normalization for each parameter
V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1= G_list
V_W2 = beta1*V_W2+ (1 - beta1)*grad_W2
V_W1 = beta1*V_W1+ (1 - beta1)*grad_W1
V_b2 = beta1*V_b2+ (1 - beta1)*grad_b2
V_b1 = beta1*V_b1+ (1 - beta1)*grad_b1
V_W2_norm = V_W2/(1 - beta1**iteration_number) # Normalisation of coefficients
V_W1_norm = V_W1/(1 - beta1**iteration_number)
V_b2_norm = V_b2/(1 - beta1**iteration_number)
V_b1_norm = V_b1/(1 - beta1**iteration_number)

S_W2 = beta2*S_W2+ (1 - beta2)*grad_W2**2
S_W1 = beta2*S_W1+ (1 - beta2)*grad_W1**2
S_b2 = beta2*S_b2+ (1 - beta2)*grad_b2**2
S_b1 = beta2*S_b1+ (1 - beta2)*grad_b1**2
S_W2_norm = S_W2/(1 - beta2**iteration_number)
S_W1_norm = S_W1/(1 - beta2**iteration_number)
S_b2_norm = S_b2/(1 - beta2**iteration_number)
S_b1_norm = S_b1/(1 - beta2**iteration_number)
G_list = [V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1]

# Gradient descent update rules
eps = 1e-6
self.W2 += alpha*V_W2_norm/(np.sqrt(S_W2_norm) + eps)
self.W1 += alpha*V_W1_norm/(np.sqrt(S_W1_norm) + eps)
self.b2 += alpha*V_b2_norm/(np.sqrt(S_b2_norm) + eps)
self.b1 += alpha*V_b1_norm/(np.sqrt(S_b1_norm) + eps)

#Update Loss
self.CE_loss(inputs, outputs)
return G_list

We compute all eight $S$ and $V$ variables (due to four trainable parameters) in this example, and include an optional normalisation of the coefficients. This allows us to include the iteration number similar to learning rate decay.

Adam optimiser

First Moment Estimate mt

Second Moment Estimate vt

Parameter Update

First Moment Estimate $m_{t}$

Second Moment Estimate $v_{t}$