RMSProp

Root Mean Square Propagation: Similar to AdaGrad but instead uses a running average of the squared gradients rather than a square root of the accumulated sum of squared gradients

This was created to address the problems with AdaGrad:

By accumulating squared gradients over time, this can cause the learning rate to decrease too aggressively (given that we decrease the learning rate by an inverse of the square of the accumulated sum).
Past gradients may continue to dominate the learning rate adjustment if they were too big initially even if they are no longer relevant

Instead, we use an exponentially decaying running average where recent gradients have more influence. This helps prevent the learning rate from decreasing too much.

For each parameter $G_{W_{i}}$ ,

G_{W_{i}} = (1 - ρ) {(\frac{\partial L}{\partial W_{i}})}^{2} + ρ G_{W_{i}}

with $ρ$ being a Hyperparameter.

We then adjust the learning rate $α_{W_{i}}$ for each parameter as:

α_{W_{i}} = \frac{α}{\sqrt{G_{W_{i}} + ϵ}}

where $ϵ$ is a small constant usually set to $1 e^{- 6}$ to prevent divisions by zero.

The parameter update rule is thus

W_{i} \leftarrow W_{i} - α_{W_{i}} \frac{δ L}{δ W_{i}}

Alternatively, $W_{t + 1} = W_{t} - \frac{η}{\sqrt{G_{t} + ϵ}} \cdot \nabla L$

def backward(self, inputs, outputs, G_list, alpha = 1e-5, rho = 0.1):
    # Get the number of samples in dataset
    m = inputs.shape[0]

    # Forward propagate
    Z1 = np.matmul(inputs, self.W1)
    Z1_b = Z1 + self.b1
    A1 = self.sigmoid(Z1_b)
    Z2 = np.matmul(A1, self.W2)
    Z2_b = Z2 + self.b2
    A2 = self.sigmoid(Z2_b)

    # Compute error term
    dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
    dL_dZ2 = dL_dA2 * A2*(1 - A2)
    dL_dA1 = np.dot(dL_dZ2, self.W2.T)
    dL_dZ1 = dL_dA1 * A1*(1 - A1)

    # Compute gradients
    grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
    grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
    grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
    grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

    # Momentum and gradient decay/normalization for each parameter
    G_W2, G_W1, G_b2, G_b1 = G_list
    G_W2 = rho*G_W2 + (1 - rho)*grad_W2**2
    G_W1 = rho*G_W1 + (1 - rho)*grad_W1**2
    G_b2 = rho*G_b2 + (1 - rho)*grad_b2**2
    G_b1 = rho*G_b1 + (1 - rho)*grad_b1**2
    G_list = [G_W2, G_W1, G_b2, G_b1]

    # Gradient descent update rules
    eps = 1e-6
    self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
    self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
    self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
    self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))

    # Update Loss
    self.CE_loss(inputs, outputs)
    return G_list