Neural Networks

Definitions

NNs are computing systems inspired by the biological neural networks that constitute human brains.

They are based on a collection of connected notes called artificial neurons.

Each connection, akin to synapses in a biological brain, can transmit a signal to other neurons, therefore processing any information given as inputs to produce a final signal as output.

Training neural networks can be done with algorithms such as Backpropagation.

Simple Neural Net

  1. __init__ method listing trainable parameters, i.e a weight vector W=(w1,w2) with 2 elements and a single scalar bias value b
  2. forward method to formulate predictions for any set of given inputs yi=w1si+w2di+b
  3. Adding a loss function: Reuse the MSE
  4. The NN could operate with any number of inputs Nx and outputs Ny by representing:
    1. W as a Nx×Ny matrix
    2. B as a Ny vector
class SimpleNeuralNet():
	def __init__(self, W, b):
		self.W = W
		self.b = b
		# Add loss
		self.loss = float("Inf")
	def forward(self, x):
		Z = np.matmul(x, self.W)
		pred = Z + self.b
		return pred
	def MSE_loss(self, inputs, outputs):
		outputs_re = outputs.reshape(-1, 1)
		pred = self.forward(inputs)
		losses = (pred - outputs_re) ** 2
		self.loss = np.sum(losses) / output.shape[0]
		return self.loss

simple_nn = SimpleNeuralNet(W = np.array([], []), b = np.ones(shape = (1,1)))

Shallow Neural Net

2025-03-04_16-23-49_Neural Networks_Scaling up with more layers.png

It will include two processing layers of WX+b:

  1. First layer will receive inputs with dimensionality nx and produce outputs with dimensionality nh.
  2. Second layer called the hidden layer will receive inputs from previous layer (of the dimensionality nh in this case) and produce outputs with dimensionality of outputs in our dataset ny
  3. Hence, matrix W is 2D nx×ny and b is 1D ny
class ShallowNeuralNet():

    def __init__(self, n_x, n_h, n_y):
        # Network dimensions
        self.n_x = n_x
        self.n_h = n_h
        self.n_y = n_y
        
        # Weights and biases matrices
        self.W1 = np.random.randn(n_x, n_h)*0.1
        self.b1 = np.random.randn(1, n_h)*0.1
        self.W2 = np.random.randn(n_h, n_y)*0.1
        self.b2 = np.random.randn(1, n_y)*0.1
        
        # Loss, initialized as infinity before first calculation is made
        self.loss = float("Inf")

    def forward(self, inputs):
        # Wx + b operation for the first layer
        Z1 = np.matmul(inputs, self.W1)
        Z1_b = Z1 + self.b1
        
        # Wx + b operation for the second layer
        Z2 = np.matmul(Z1_b, self.W2)
        Z2_b = Z2 + self.b2
        return Z2_b

    def MSE_loss(self, inputs, outputs):
        # MSE loss function as before
        outputs_re = outputs.reshape(-1, 1)
        pred = self.forward(inputs)
        losses = (pred - outputs_re)**2
        self.loss = np.sum(losses)/outputs.shape[0]
        return self.loss

We now need a training procedure since the weights and biases are currently randomly generated (See Backpropagation).

Training Procedure

  1. Backward Method - performs calculation of gradients in matrix form and parameter adjustments using GD update rules
  2. Forward Method - repeats backward method, reusing GD for-loop function until a maximal number of iterations is reached, or convergence is seen

i.e. the procedure:

  1. Forward pass on dataset samples
  2. Compute errors and loss function
  3. Use new loss value to compute gradients
  4. Adjust parameters with Backpropagation
  5. Repeat until convergence or max number of iterations
def train(self, inputs, outputs, N_max = 1000, alpha = 1e-5, delta = 1e-5, display = True):
  # List of losses, starts with the current loss
  self.losses_list = [self.loss]
  # Repeat iterations
  for iteration_number in range(1, N_max + 1):
    # Backpropagate
    self.backward(inputs, outputs, alpha)
    new_loss = self.loss
    # Update losses list
    self.losses_list.append(new_loss)
    # Display
    if(display):
      print("Iteration {} - Loss = {}".format(iteration_number, new_loss))
    # Check for delta value and early stop criterion
    difference = abs(self.losses_list[-1] - self.losses_list[-2])
    if(difference < delta):
      if(display):
        print("Stopping early - loss evolution was less than delta.")
      break
    else:
      # Else on for loop will execute if break did not trigger
      if(display):
        print("Stopping - Maximal number of iterations reached.")

def show_losses_over_training(self):
  # Initialize matplotlib
  fig, axs = plt.subplots(1, 2, figsize = (15, 5))
  axs[0].plot(list(range(len(self.losses_list))), self.losses_list)
  axs[0].set_xlabel("Iteration number")
  axs[0].set_ylabel("Loss")
  axs[1].plot(list(range(len(self.losses_list))), self.losses_list)
  axs[1].set_xlabel("Iteration number")
  axs[1].set_ylabel("Loss (in logarithmic scale)")
  axs[1].set_yscale("log")
  # Display
  plt.show()

Symmetry

Initialising all parameters as identical constants (e.g. all zeros) is bad because of the tendency of all neurons to have the same weights and processes.

The lack of diversity leads to a lack of generalisation which prevents the NN from learning complex patterns.

This happens because all weights and baises had the same starting point. The backward process then updated the parameters identically and they end up keeping the same values over the course of training. Hence, we need random initialisation.

Symmetrical neural networks are vulnerable to Adversarial attacks

Model design

Constant initialisation is important to consider as it may lead to issues such as Vanishing gradients. Instead of zero/same constant initialisation, we need to use random starting values, or variants such as Xavier initialisation.

Selecting a proper learning rate is also important to tackle Exploding gradients.

Also, if our model only consists of two linear layers (as seen above), the boundary for our models is also linear in x1 and x2, and we cannot model more complex behaviour. Hence, we need to rely on Activation functions

Introducing non-linearity

Adding sigmoid operations after each linear operation in the forward method:

def forward(self, inputs):
  # Wx + b operation for the first layer
  Z1 = np.matmul(inputs, self.W1)
  Z1_b = Z1 + self.b1
  A1 = self.sigmoid(Z1_b)
  # Wx + b operation for the second layer
  Z2 = np.matmul(A1, self.W2)
  Z2_b = Z2 + self.b2
  y_pred = self.sigmoid(Z2_b)
  return y_pred

This requires the backward propagation to be updated.

Inputs X
Linear #1
Z₁ = W₁X + b₁
Sigmoid
A₁ = s(Z₁)
Linear #2
Z₂ = W₂X + b₂
Sigmoid
A₂ = s(Z₂)
Cross Entropy Loss
L(A₂, Y)

LinReg + Sigmoid = LogReg, so LinReg Layer + Sigmoid = LogReg Layer?

Throwing formulas around:

Z1=W1X+b1A1=s(Z1)Z2=W2A1+b2A2=s(Z2)L=1NiNYln(A2)+(1Y)ln(1A2)LA2=YA2+1Y1A2A2=s(Z2)s(X)=s(X)(1s(X))LZ2=LA2A2Z2=LA2A2(1A2)Z2=W2A1+b2LW2=LZ2Z2W2=LZ2A1Lb2=LZ2Z2b2=LZ2A1=s(Z1)LA1=LZ2Z2A1=LZ2W2LZ1=LA1A1Z1=LA1A1(1A1)Z1=W1X+b1LW1=LZ1Z1W1=LZ1XLb1=LZ1Z1b1=LZ1

Updating our backward method:

def backward(self, inputs, outputs, alpha = 1e-5):
    # Get the number of samples in dataset
    m = inputs.shape[0]

    # Forward propagate
    Z1 = np.matmul(inputs, self.W1)
    Z1_b = Z1 + self.b1
    A1 = self.sigmoid(Z1_b)
    Z2 = np.matmul(A1, self.W2)
    Z2_b = Z2 + self.b2
    A2 = self.sigmoid(Z2_b)

    # Compute error term
    dL_dA2 = -outputs/A2 + (1 - outputs)/(1 - A2)
    dL_dZ2 = dL_dA2*A2*(1 - A2)
    dL_dA1 = np.dot(dL_dZ2, self.W2.T)
    dL_dZ1 = dL_dA1*A1*(1 - A1)

    # Gradient descent update rules
    self.W2 -= (1/m)*alpha*np.dot(A1.T, dL_dZ2)
    self.W1 -= (1/m)*alpha*np.dot(inputs.T, dL_dZ1)
    self.b2 -= (1/m)*alpha*np.sum(dL_dZ2, axis = 0, keepdims = True)
    self.b1 -= (1/m)*alpha*np.sum(dL_dZ1, axis = 0, keepdims = True)

    # Update Loss
    self.CE_loss(inputs, outputs)

We use Activation functions to introduce non-linearity into our models. Sigmoid and tanh functions are often used, but both often lead to vanishing gradients.

Hence, we can use ReLU function or Leaky ReLU function due to their computational efficiencies and prevention of saturation

Improving gradient descent

We can improve gradient descent further (see Gradient Descent#Improving gradient descent) by introducing Momentum, Gradient-based learning rate control and Learning rate decay through different optimisation algorithms

Examples of optimisers

Improving computation rate

We can perform parameter training updates with smaller batches of the dataset rather than the entire dataset per iteration by utilising Stochastic Gradient Descent, ideally mini-batch.

Good Practices

Additional performance metrics

Low loss does not mean successful training, because we have only minimised it. We should instead use more interpretable performance metrics

For instance, using accuracy:

def accuracy(self, inputs, outputs):
    # Calculate accuracy for given inputs and ouputs
    pred = [int(val >= 0.5) for val in self.forward(inputs)]
    acc = sum([int(val1 == val2[0]) for val1, val2 in zip(pred, outputs)])/outputs.shape[0]
    return acc

def train(self, inputs, outputs, N_max = 1000, alpha = 1e-5, beta1 = 0.9, beta2 = 0.999, \
          delta = 1e-5, batch_size = 100, display = True):
    # Get number of samples
    M = inputs.shape[0]

    # List of losses, starts with the current loss
    self.losses_list = [self.CE_loss(inputs, outputs)]
    self.accuracies_list = [self.accuracy(inputs, outputs)]

Train-test-validation split

We are not here to minimise the training loss function (or maximise accuracy) but to generalise well on unseen data. So expanding on Train and test split, we introduce a validation dataset.

This will help us access the performance of the model and how well it would generalise.

Implementing early stop

In general,

We should try and stop the training when the model starts losing it generalization capabilities and starts overfitting, when:

Early stopping can be defined by interrupting training when:

Saver and Loader functions

Since we will often stop training after it is too late (after it loses generalisation capabilities), we should save model parameters every K iterations

def save(self, path_to_file, iter_num = "final"):
    # Display
    folder = path_to_file + "/" + iter_num + "/"
    print("Saving model to", folder)

    # Check if directory exists
    if(not os.path.exists(folder)):
        os.mkdir(folder)

    # Dump
    with open(folder + "W1.pkl", 'wb') as f:
        pickle.dump(self.W1, f)
        f.close()
    with open(folder + "W2.pkl", 'wb') as f:
        pickle.dump(self.W2, f)
        f.close()
    with open(folder + "b1.pkl", 'wb') as f:
        pickle.dump(self.b1, f)
        f.close()
    with open(folder + "b2.pkl", 'wb') as f:
        pickle.dump(self.b2, f)
        f.close()

# Save model
self.save("./save", iter_num = str(iteration_number))

After training, load the model parameters (W1,W2,b1,b2) that correspond to the best iteration for generalisation, e.g. by manually analysing training curves

Saving is also used to ensure reproducibility of results and preventing data loss.

2025-03-06_00-54-04_Neural Networks_ShallowNeuralNet_PT.png

Dense Neural Networks

A neural network with more than two hidden layers

Good practices

class DenseReLU(torch.nn.Module):
    def __init__(self, n_x, n_y):
        super().__init__()
        # Define Linear layer using the nn.Linear()
        self.fc = torch.nn.Linear(n_x, n_y)

    def forward(self, x):
        # Wx + b operation
        # Using ReLU operation as activation after
        return torch.relu(self.fc(x))

class DenseNoReLU(torch.nn.Module):
    def __init__(self, n_x, n_y):
        super().__init__()
        # Define Linear layer using the nn.Linear()
        self.fc = torch.nn.Linear(n_x, n_y)

    def forward(self, x):
        # Wx + b operation
        # No activation function
        return self.fc(x)
class DeepNeuralNet(torch.nn.Module):
    def __init__(self, n_x, n_h, n_y):
        super().__init__()
        # Define three Dense + ReLU Layers,
        # followed by one Dense + Softmax Layer
        self.layer1 = DenseReLU(n_x, n_h[0])
        self.layer2 = DenseReLU(n_h[0], n_h[1])
        self.layer3 = DenseReLU(n_h[1], n_h[2])
        self.layer4 = DenseNoReLU(n_h[2], n_y)

        # Combine all four layers
        self.combined_layers = torch.nn.Sequential(self.layer1,
											       self.layer2,
                                                   self.layer3,
	                                               self.layer4)

    def forward(self, x):
        # Flatten images (transform them from 28x28
        # 2D matrices to 784 1D vectors)
        x = x.view(x.size(0), -1)
        # Pass through all four layers
        out = self.combined_layers(x)
        return out

# Initialize the model and optimizer
model = DeepNeuralNet(n_x=784, n_h=[80, 40, 20], n_y=10).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Deeper networks may be prone to overfitting if the network is too deep.

Conversely, shallower networks may be easier to train and require less data but may not be able to learn as complex patterns.