Batchnorm layer

Technique to improve stability and performance of Neural Networks. Cannot be trained.

y=γ(xμ)σ+β

where:

It normalises the activations produced by each layer for each mini-batch, resulting in a more stable distribution.

Internal covariate shift refers to the change in the distribution of network activations as network parameters update during training. As the distribution of activations keeps changing, the later layers must constantly adapt to new distributions, with gradients becoming unstable if activations grow in extremes.

As each activations are normalised, the network can converge more quickly.

Just as we normalise input data, Batchnorm normalises activations at every layer instead of only input layer.

Similar to Dropout layer, the module should be inserted:

Depending on whether the data is 1D shaped (linear layer output) or 2D shaped (image output), we use nn.BatchNorm1d or nn.BatchNorm2d respectively.

Differences in training and evaluation mode

train(): Updates mean and variance estimates based on current mini-batch
eval(): Uses the mean and variance estimates last calculated during training