Stochastic Gradient Descent

We formulate predictions for a single sample, randomly drawn in the dataset, to compute the value of the Loss function for current model parameters and then perform Backpropagation.

def train(self, inputs, outputs, N_max = 1000, alpha = 1e-5, beta1=0.9, beta2=0.999, delta = 1e-5, displa
    # Get number of samples
    M = inputs.shape[0]
    # List of losses, starts with the current loss
    self.losses_list = [self.CE_loss(inputs, outputs)]
    # Initialize G_list
    G_list = [0*self.W2, 0*self.W1, 0*self.b2, 0*self.b1, \
              0*self.W2, 0*self.W1, 0*self.b2, 0*self.b1]
    # Repeat iterations
    for iteration_number in range(1, N_max + 1):
        # Stochastic GD on one randomly chosen sample
        indexes = np.random.randint(0, M)
        inputs_sub = np.array([inputs[indexes, :]])
        outputs_sub = np.array([outputs[indexes, :]])
        # Backpropagate
        G_list, loss = self.backward(inputs_sub, outputs_sub, G_list, iteration_number, alpha, beta1, beta2)

Since most loss functions are using mean error values over several samples, we are trying to reduce Mean Square Error here.

This method doesn't lead to a good estimation of the Mean Square Error but it is faster to train.

Mini-batch Gradient Descent

Gradient Descent with a subset $N^{'} < N$ randomly drawn samples from the dataset

def train(self, inputs, outputs, N_max = 1000, alpha = 1e-5, beta1 = 0.9, beta2 = 0.999, \
          delta = 1e-5, batch_size = 100, display = True):
    # Get number of samples
    M = inputs.shape[0]
    # List of losses, starts with the current loss
    self.losses_list = [self.CE_loss(inputs, outputs)]
    # Initialize G_list
    G_list = [0*self.W2, 0*self.W1, 0*self.b2, 0*self.b1, \
              0*self.W2, 0*self.W1, 0*self.b2, 0*self.b1]
    # Define RNG for stochastic minibatches
    rng = default_rng()

    # Repeat iterations
    for iteration_number in range(1, N_max + 1):
        # Select a subset of inputs and outputs with given batch size
        shuffler = rng.choice(M, size = batch_size, replace = False)
        inputs_sub = inputs[shuffler, :]
        outputs_sub = outputs[shuffler, :]

        # Backpropagate
        G_list, loss = self.backward(inputs_sub, outputs_sub, G_list, iteration_number, alpha, beta1, beta2)

Batch sizes

Usually better to choose a batch site $N^{'}$ defined by a power of 2 i.e $N = {32, 64, 128, 256, 512}$

Larger batch size means slower computation but better training performance.