RMSProp

Root Mean Square Propagation: Similar to AdaGrad but instead uses a running average of the squared gradients rather than a square root of the accumulated sum of squared gradients

This was created to address the problems with AdaGrad:

Instead, we use an exponentially decaying running average where recent gradients have more influence. This helps prevent the learning rate from decreasing too much.

For each parameter GWi,

GWi=(1ρ)(LWi)2+ρ GWi

with ρ being a Hyperparameter.

We then adjust the learning rate αWi for each parameter as:

αWi=αGWi+ϵ

where ϵ is a small constant usually set to 1e6 to prevent divisions by zero.

The parameter update rule is thus

WiWiαWiδLδWi

Alternatively, Wt+1=WtηGt+ϵL

def backward(self, inputs, outputs, G_list, alpha = 1e-5, rho = 0.1):
    # Get the number of samples in dataset
    m = inputs.shape[0]

    # Forward propagate
    Z1 = np.matmul(inputs, self.W1)
    Z1_b = Z1 + self.b1
    A1 = self.sigmoid(Z1_b)
    Z2 = np.matmul(A1, self.W2)
    Z2_b = Z2 + self.b2
    A2 = self.sigmoid(Z2_b)

    # Compute error term
    dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
    dL_dZ2 = dL_dA2 * A2*(1 - A2)
    dL_dA1 = np.dot(dL_dZ2, self.W2.T)
    dL_dZ1 = dL_dA1 * A1*(1 - A1)

    # Compute gradients
    grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
    grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
    grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
    grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)

    # Momentum and gradient decay/normalization for each parameter
    G_W2, G_W1, G_b2, G_b1 = G_list
    G_W2 = rho*G_W2 + (1 - rho)*grad_W2**2
    G_W1 = rho*G_W1 + (1 - rho)*grad_W1**2
    G_b2 = rho*G_b2 + (1 - rho)*grad_b2**2
    G_b1 = rho*G_b1 + (1 - rho)*grad_b1**2
    G_list = [G_W2, G_W1, G_b2, G_b1]

    # Gradient descent update rules
    eps = 1e-6
    self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
    self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
    self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
    self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))

    # Update Loss
    self.CE_loss(inputs, outputs)
    return G_list