RMSProp
Root Mean Square Propagation: Similar to AdaGrad but instead uses a running average of the squared gradients rather than a square root of the accumulated sum of squared gradients
This was created to address the problems with AdaGrad:
- By accumulating squared gradients over time, this can cause the learning rate to decrease too aggressively (given that we decrease the learning rate by an inverse of the square of the accumulated sum).
- Past gradients may continue to dominate the learning rate adjustment if they were too big initially even if they are no longer relevant
Instead, we use an exponentially decaying running average where recent gradients have more influence. This helps prevent the learning rate from decreasing too much.
For each parameter
with
We then adjust the learning rate
where
The parameter update rule is thus
Alternatively,
def backward(self, inputs, outputs, G_list, alpha = 1e-5, rho = 0.1):
# Get the number of samples in dataset
m = inputs.shape[0]
# Forward propagate
Z1 = np.matmul(inputs, self.W1)
Z1_b = Z1 + self.b1
A1 = self.sigmoid(Z1_b)
Z2 = np.matmul(A1, self.W2)
Z2_b = Z2 + self.b2
A2 = self.sigmoid(Z2_b)
# Compute error term
dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
dL_dZ2 = dL_dA2 * A2*(1 - A2)
dL_dA1 = np.dot(dL_dZ2, self.W2.T)
dL_dZ1 = dL_dA1 * A1*(1 - A1)
# Compute gradients
grad_W2 = (-1/m)*np.dot(A1.T, dL_dZ2)
grad_W1 = (-1/m)*np.dot(inputs.T, dL_dZ1)
grad_b2 = (-1/m)*np.sum(dL_dZ2, axis = 0, keepdims = True)
grad_b1 = (-1/m)*np.sum(dL_dZ1, axis = 0, keepdims = True)
# Momentum and gradient decay/normalization for each parameter
G_W2, G_W1, G_b2, G_b1 = G_list
G_W2 = rho*G_W2 + (1 - rho)*grad_W2**2
G_W1 = rho*G_W1 + (1 - rho)*grad_W1**2
G_b2 = rho*G_b2 + (1 - rho)*grad_b2**2
G_b1 = rho*G_b1 + (1 - rho)*grad_b1**2
G_list = [G_W2, G_W1, G_b2, G_b1]
# Gradient descent update rules
eps = 1e-6
self.W2 += alpha*grad_W2/(np.sqrt(G_W2 + eps))
self.W1 += alpha*grad_W1/(np.sqrt(G_W1 + eps))
self.b2 += alpha*grad_b2/(np.sqrt(G_b2 + eps))
self.b1 += alpha*grad_b1/(np.sqrt(G_b1 + eps))
# Update Loss
self.CE_loss(inputs, outputs)
return G_list