Convolutional Neural Networks

Neural Networks that integrate convolutions as the processing operations to be used on input images instead of linear ones.

The NN will decide on which kernel values (image processing operations) to use in the convolution operations run in both parallel and sequence.

In building a CNN for classification, the initial layers will be built with Conv2d operations, and the final layers must be linear layers. This is because the final output must consist of probabilities.

There are a few non-trainable layers that were invented for CNNs in order to improve robustness, efficiency and prevent overfitting.

We can also increase the size of the training dataset by applying various transformations to the existing data through Data augmentation.

With all non-trainable layers

class MNIST_CNN_all(nn.Module):
    def __init__(self):
        super(MNIST_CNN_all, self).__init__()
        # Two convolutional layers
        self.conv1 = nn.Conv2d(1, 32, kernel_size = 3, stride = 1, padding = 1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size = 3, stride = 1, padding = 1)
        # Two fully connected layers
        self.fc1 = nn.Linear(64*28*28, 128) # 64*28*28 = 50176
        self.fc2 = nn.Linear(128, 10)
        # Batch normalization layers
        self.batch_norm1 = nn.BatchNorm2d(32)
        self.batch_norm2 = nn.BatchNorm2d(64)
        self.batch_norm3 = nn.BatchNorm1d(128)
        # Dropout layers
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.dropout3 = nn.Dropout(0.25)
        # MaxPool Layers
        self.maxpool2d = F.max_pool2dd

	def forward(self, x):
	    # Pass input through first convolutional Layer
	    x = self.conv1(x)
	    x = F.relu(x)
	    x = self.batch_norm1(x)
	    x = self.dropout1(x)
	    # Pass output of first conv Layer through second convolutional Layer
	    # Pooling only once on second Layer (we could also do it on the first one)
	    x = self.conv2(x)
	    x = F.relu(x)
	    x = self.batch_norm2(x)
	    x = self.dropout2(x)
	    x = self.maxpool2d(x, 2)
	    # Flatten output of second conv Layer
	    x = x.view(-1, 64*14*14)
	    # Pass flattened output through first fully connected Layer
	    x = self.fc1(x)
	    x = F.relu(x)
	    x = self.batch_norm3(x)
	    x = self.dropout3(x)
	    # Pass output of first fully connected layer through second fully connected layer
	    x = self.fc2(x)
	    return x

model = MNIST_CNN_all()

for inputs, labels in train_loader:
    out = model(inputs)
    print(out.shape)
    print(labels.shape)
    break

A quick history on remarkable CV models

AlexNet
- One of the first architectures to combine Conv2d with Dropout, Pooling, RELU
- Sheer number of layers and trainable parameters
- Trained on GPUs instead of CPUs to improve computational performance
VGG
- Based on AlexNet
- instead of $11 \times 11$ kernel with a stride of 4, it uses smaller $3 \times 3$ kernels with a stride of 1
- Uses 3 ReLU units instead of 1
- Decision function is more discriminative, fewer parameters (27 $c$ instead of 49 $c$ )
- 1x1 convolution layers to make decision function more non-linear without changing receptive fields (because it does not aggregrate information from neighbouring pixels)
- Fewer parameters allowed for larger number of layers and better performance
ResNet (Residual Networks)
- Utilises Skip connections which simplifies the network and reduces impact of Vanishing gradients
- Network gradually restores skipped layers as it learns the feature space, reusing previous inputs in deeper layers
- Very large models of up to 152 layers can be trained
DenseNet
- A type of ResNet where all layers with matching feature-map sizes are connected directly to each other as in ResNet
- All possible skip connections have been drawn in every ResidualBlock
- Even bigger layers
Inception models
- Built to tackle salient parts of images having large variations in size e.g. the subject taking up different areas compositionally
- By using multiple convolution filters with different sizes in parallel
- Allows processing of images with different salient sizes in the same network
- Then assembles multiple Inception blocks in sequence, while also reusing Pooling layer, Batchnorm layer, Dropout layer, Skip connections
EfficientNet
- Scaling method that uniformly scales all dimensions of depth, width and resolution using a compound coefficient. In other words, adjust size of layers and number of parameters dynamically
- To solve: If the input image is bigger, then the network needs more layers to increase the receptive field and more channels are needed to capture more fine-grained patterns on the bigger image
- Different sizes for the pre-trained model were released, ranging from b0 (smaller) to b7 (larger).