Adam optimiser
Adaptive Moment estimation: Very widely used optimiser that combines Momentum and RMSProp. Achieves good performance with relatively little Hyperparameter tuning
We combine two running averages:
- Exponentially decaying running averages of past gradients
- Exponentially decaying running averages of past squared parameters
First Moment Estimate
This represents the mean of gradients and is similar to Standard momentum as it accumulates past gradients in an exponentially weighted moving average.
This allows Adam to accelerate learning in relevant directions.
- % Lecture notes represent first moment estimate ("first parameter") as
instead of
where
is typically 0.9, a decay rate that controls how much past gradients influence the current update
Second Moment Estimate
This represents variance of gradients and is similar to RMSProp, scaling the learning rate based on magnitude of past gradients.
This allows parameters with large gradients to get smaller updates, and smaller gradients to get larger updates. We get best of both worlds.
is typically 0.999
Parameter Update
Again, we include
-
! This lecture notes version has no indication of Bias correction!
-
? Why remove the
from the square root?
grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)
# Momentum and gradient decay/normalization for each parameter
V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1= G_list
V_W2 = beta1*V_W2+ (1 - beta1)*grad_W2
V_W1 = beta1*V_W1+ (1 - beta1)*grad_W1
V_b2 = beta1*V_b2+ (1 - beta1)*grad_b2
V_b1 = beta1*V_b1+ (1 - beta1)*grad_b1
V_W2_norm = V_W2/(1 - beta1**iteration_number) # Normalisation of coefficients
V_W1_norm = V_W1/(1 - beta1**iteration_number)
V_b2_norm = V_b2/(1 - beta1**iteration_number)
V_b1_norm = V_b1/(1 - beta1**iteration_number)
S_W2 = beta2*S_W2+ (1 - beta2)*grad_W2**2
S_W1 = beta2*S_W1+ (1 - beta2)*grad_W1**2
S_b2 = beta2*S_b2+ (1 - beta2)*grad_b2**2
S_b1 = beta2*S_b1+ (1 - beta2)*grad_b1**2
S_W2_norm = S_W2/(1 - beta2**iteration_number)
S_W1_norm = S_W1/(1 - beta2**iteration_number)
S_b2_norm = S_b2/(1 - beta2**iteration_number)
S_b1_norm = S_b1/(1 - beta2**iteration_number)
G_list = [V_W2, V_W1, V_b2, V_b1, S_W2, S_W1, S_b2, S_b1]
# Gradient descent update rules
eps = 1e-6
self.W2 += alpha*V_W2_norm/(np.sqrt(S_W2_norm) + eps)
self.W1 += alpha*V_W1_norm/(np.sqrt(S_W1_norm) + eps)
self.b2 += alpha*V_b2_norm/(np.sqrt(S_b2_norm) + eps)
self.b1 += alpha*V_b1_norm/(np.sqrt(S_b1_norm) + eps)
#Update Loss
self.CE_loss(inputs, outputs)
return G_list
We compute all eight