AdaGrad

Adaptive Gradient algorithm: Adapts the Learning rate for each separate parameter based on the sum of past squared gradients. This is a form of Gradient-based learning rate control

This helps scale down the learning rates for frequently updated parameters, while maintaining relatively higher learning rates for parameters that are less frequently updated

For each parameter Wi, AdaGrad maintains a cumulative sum of squared gradients as

GWiGWi+(δLδWi)2

where:

We then adjust the learning rate αWi for each parameter as:

αWi=αGWi+ϵ

where ϵ is a small constant usually set to 1e6 to prevent divisions by zero.

The parameter update rule is thus

WiWiαWiδLδWi

Alternatively, Wt+1=WtηGt+ϵL

This is great for sparse data such as NLP and recommendation systems because:

def backward(self, inputs, outputs, G list, alpha = 1e-5):
    # Get the number of samples in dataset
    m = inputs.shape[0]

    # Forward propagate
    Z1 = np.matmul(inputs, self.W1)
    Z1_b = Z1 + self.b1
    A1 = self.sigmoid(Z1_b)
    Z2 = np.matmul(A1, self.W2)
    Z2_b = Z2 + self.b2
    A2 = self.sigmoid(Z2_b)

    # Compute error term
    dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
    dL_dZ2 = dL_dA2 * A2*(1 - A2)
    dL_dA1 = np.dot(dL_dZ2, self.W2.T)
    dL_dZ1 = dL_dA1 * A1*(1 - A1)

    # Gradient descent update rules
    grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
    grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
    grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
    grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

    # Momentum and gradient decay/normalization for each parameter
    # [G_W2, G_W1, G_b2, G_b1] = G_list
    G_W2 += grad_W2**2
    G_W1 += grad_W1**2
    G_b2 += grad_b2**2
    G_b1 += grad_b1**2
    G_list = [G_W2, G_W1, G_b2, G_b1]

    # Gradient descent update rules
    eps = 1e-6
    self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
    self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
    self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
    self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))

    # Update loss
    self.CE_loss(inputs, outputs)
    return G_list