Skip connections

Direct connections between non-adjacent layers in Neural Networks to bypass one or more intermediate layers

This allows information to flow rapidly from inputs to outputs without being transformed or lost between layers

Enables network to retain information from input layer instead of being transformed multiple times and losing original information
Facilitates information flow across many layers, mitigating degradation or Vanishing gradients that appear in deep models with many layers

When training the parameters on the first layers of a very deep model, the gradients for the parameters of the first layer are close to zero.

Reason: using the chain rule many times in a row, multiplying partial derivatives with small values eventually leads to a very small, close to zero value for those partial derivatives.

\frac{\partial L}{\partial W_{1}} = \frac{\partial L}{\partial A_{25}} \frac{\partial A_{25}}{\partial A_{24}} . . . \frac{\partial A_{1}}{\partial W_{1}}

Residual Block

class ResidualBlock(nn.Module):
    def __init__(self, n):
        super(ResidualBlock, self).__init__()

        # Conv Layers
        self.conv1 = nn.Conv2d(n, n, 1)
        self.conv2 = nn.Conv2d(n, n, 3, 1, 1)

        # Final Linear Layer
        self.classifier = nn.Linear(n*24*24, 751)

    def forward(self, x):
        # First Conv block (Conv2d + ReLU), no residual.
        out = F.relu(self.conv1(x))

        # Second Conv block, add input x as residual.
        out = F.relu(self.conv2(out)) + x

        # Resize
        out = out.view(out.size(0), -1)

        # Final Layer
        out = self.classifier(out)
        return out