Adam optimiser

Adaptive Moment estimation: Very widely used optimiser that combines Momentum and RMSProp. Achieves good performance with relatively little Hyperparameter tuning

We combine two running averages:

First Moment Estimate mt

This represents the mean of gradients and is similar to Standard momentum as it accumulates past gradients in an exponentially weighted moving average.

This allows Adam to accelerate learning in relevant directions.

Vw2=(1β1)LW2+β1Vw2mt=(1β1)gt+β1mt1

where gt is the gradient at time step t or we can use δLδW2

Second Moment Estimate vt

This represents variance of gradients and is similar to RMSProp, scaling the learning rate based on magnitude of past gradients.

This allows parameters with large gradients to get smaller updates, and smaller gradients to get larger updates. We get best of both worlds.

Sw2=(1β2)(LW2)2+β2Sw2vt=(1β2)gt2+β2vt1

Parameter Update

Again, we include ϵ=1e6 to prevent divisions by zero, and remove it from the square root.

Cw2=αVw2Sw2+ϵW2W2Cw2
grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

# Momentum and gradient decay/normalization for each parameter
V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1= G_list
V_W2 = beta1*V_W2+ (1 - beta1)*grad_W2
V_W1 = beta1*V_W1+ (1 - beta1)*grad_W1
V_b2 = beta1*V_b2+ (1 - beta1)*grad_b2
V_b1 = beta1*V_b1+ (1 - beta1)*grad_b1
V_W2_norm = V_W2/(1 - beta1**iteration_number) # Normalisation of coefficients
V_W1_norm = V_W1/(1 - beta1**iteration_number)
V_b2_norm = V_b2/(1 - beta1**iteration_number)
V_b1_norm = V_b1/(1 - beta1**iteration_number)

S_W2 = beta2*S_W2+ (1 - beta2)*grad_W2**2
S_W1 = beta2*S_W1+ (1 - beta2)*grad_W1**2
S_b2 = beta2*S_b2+ (1 - beta2)*grad_b2**2
S_b1 = beta2*S_b1+ (1 - beta2)*grad_b1**2
S_W2_norm = S_W2/(1 - beta2**iteration_number)
S_W1_norm = S_W1/(1 - beta2**iteration_number)
S_b2_norm = S_b2/(1 - beta2**iteration_number)
S_b1_norm = S_b1/(1 - beta2**iteration_number)
G_list = [V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1]

# Gradient descent update rules
eps = 1e-6
self.W2 += alpha*V_W2_norm/(np.sqrt(S_W2_norm) + eps)
self.W1 += alpha*V_W1_norm/(np.sqrt(S_W1_norm) + eps)
self.b2 += alpha*V_b2_norm/(np.sqrt(S_b2_norm) + eps)
self.b1 += alpha*V_b1_norm/(np.sqrt(S_b1_norm) + eps)

#Update Loss
self.CE_loss(inputs, outputs)
return G_list

We compute all eight S and V variables (due to four trainable parameters) in this example, and include an optional normalisation of the coefficients. This allows us to include the iteration number similar to learning rate decay.