AdaGrad
Adaptive Gradient algorithm: Adapts the Learning rate for each separate parameter based on the sum of past squared gradients. This is a form of Gradient-based learning rate control
This helps scale down the learning rates for frequently updated parameters, while maintaining relatively higher learning rates for parameters that are less frequently updated
For each parameter
where:
is the gradient of the loss function with respect to is the cumulated sum of the squared gradients of each parameter, such as and
We then adjust the learning rate
where
The parameter update rule is thus
Alternatively,
This is great for sparse data such as NLP and recommendation systems because:
- Rarely updated parameters still retain a higher learning rate to learn effectively (such as parameters like word embeddings for less common words which receive few updates)
- Frequently updated parameters begin to slow down to prevent Overfitting
- This can be considered a form of Regularisation
def backward(self, inputs, outputs, G list, alpha = 1e-5):
# Get the number of samples in dataset
m = inputs.shape[0]
# Forward propagate
Z1 = np.matmul(inputs, self.W1)
Z1_b = Z1 + self.b1
A1 = self.sigmoid(Z1_b)
Z2 = np.matmul(A1, self.W2)
Z2_b = Z2 + self.b2
A2 = self.sigmoid(Z2_b)
# Compute error term
dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
dL_dZ2 = dL_dA2 * A2*(1 - A2)
dL_dA1 = np.dot(dL_dZ2, self.W2.T)
dL_dZ1 = dL_dA1 * A1*(1 - A1)
# Gradient descent update rules
grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)
# Momentum and gradient decay/normalization for each parameter
# [G_W2, G_W1, G_b2, G_b1] = G_list
G_W2 += grad_W2**2
G_W1 += grad_W1**2
G_b2 += grad_b2**2
G_b1 += grad_b1**2
G_list = [G_W2, G_W1, G_b2, G_b1]
# Gradient descent update rules
eps = 1e-6
self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))
# Update loss
self.CE_loss(inputs, outputs)
return G_list