Batchnorm layer

Technique to improve stability and performance of Neural Networks. Cannot be trained.

y = \frac{γ (x - μ)}{σ} + β

where:

$x$ (resp. $y$ ) is the input (resp. output) to the batchnorm layer
$μ$ and $σ$ are the mean and standard deviation of the batch, respectively, and are calculated separately for each feature dimension in the input batch $x$
$γ$ and $β$ are trainable parameters for the batchnorm layer, also known as scale and shift parameters

It normalises the activations produced by each layer for each mini-batch, resulting in a more stable distribution.

Internal covariate shift refers to the change in the distribution of network activations as network parameters update during training. As the distribution of activations keeps changing, the later layers must constantly adapt to new distributions, with gradients becoming unstable if activations grow in extremes.

As each activations are normalised, the network can converge more quickly.

Just as we normalise input data, Batchnorm normalises activations at every layer instead of only input layer.

Similar to Dropout layer, the module should be inserted:

after a linear or convolutional layer
before/after a non-linear activation function.

Depending on whether the data is 1D shaped (linear layer output) or 2D shaped (image output), we use nn.BatchNorm1d or nn.BatchNorm2d respectively.

Differences in training and evaluation mode

train(): Updates mean and variance estimates based on current mini-batch
eval(): Uses the mean and variance estimates last calculated during training