AdaGrad

Adaptive Gradient algorithm: Adapts the Learning rate for each separate parameter based on the sum of past squared gradients. This is a form of Gradient-based learning rate control

This helps scale down the learning rates for frequently updated parameters, while maintaining relatively higher learning rates for parameters that are less frequently updated

For each parameter $W_{i}$ , AdaGrad maintains a cumulative sum of squared gradients as

G_{W_{i}} \leftarrow G_{W_{i}} + (\frac{δ L}{δ W_{i}})^{2}

where:

$\frac{δ L}{δ W_{i}}$ is the gradient of the loss function $L$ with respect to $W_{i}$
$G_{W_{i}}$ is the cumulated sum of the squared gradients of each parameter, such as $W$ and $b$

We then adjust the learning rate $α_{W_{i}}$ for each parameter as:

α_{W_{i}} = \frac{α}{\sqrt{G_{W_{i}} + ϵ}}

where $ϵ$ is a small constant usually set to $1 e^{- 6}$ to prevent divisions by zero.

The parameter update rule is thus

W_{i} \leftarrow W_{i} - α_{W_{i}} \frac{δ L}{δ W_{i}}

Alternatively, $W_{t + 1} = W_{t} - \frac{η}{\sqrt{G_{t} + ϵ}} \cdot \nabla L$

This is great for sparse data such as NLP and recommendation systems because:

Rarely updated parameters still retain a higher learning rate to learn effectively (such as parameters like word embeddings for less common words which receive few updates)
Frequently updated parameters begin to slow down to prevent Overfitting
This can be considered a form of Regularisation

def backward(self, inputs, outputs, G list, alpha = 1e-5):
    # Get the number of samples in dataset
    m = inputs.shape[0]

    # Forward propagate
    Z1 = np.matmul(inputs, self.W1)
    Z1_b = Z1 + self.b1
    A1 = self.sigmoid(Z1_b)
    Z2 = np.matmul(A1, self.W2)
    Z2_b = Z2 + self.b2
    A2 = self.sigmoid(Z2_b)

    # Compute error term
    dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
    dL_dZ2 = dL_dA2 * A2*(1 - A2)
    dL_dA1 = np.dot(dL_dZ2, self.W2.T)
    dL_dZ1 = dL_dA1 * A1*(1 - A1)

    # Gradient descent update rules
    grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
    grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
    grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
    grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

    # Momentum and gradient decay/normalization for each parameter
    # [G_W2, G_W1, G_b2, G_b1] = G_list
    G_W2 += grad_W2**2
    G_W1 += grad_W1**2
    G_b2 += grad_b2**2
    G_b1 += grad_b1**2
    G_list = [G_W2, G_W1, G_b2, G_b1]

    # Gradient descent update rules
    eps = 1e-6
    self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
    self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
    self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
    self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))

    # Update loss
    self.CE_loss(inputs, outputs)
    return G_list