Multi-class classification

Unlike Binary classification, the multi-class classification model has to output $N$ number of probability values $p_{i}$ , where each $p_{i}$ corresponds to the probability of being of class $i$ .

The $p_{i}$ should represent probabilities and hence $$\sum^{9}_{i=0}p_i = 1$$
Dense layers $W = Y X + b$ cannot produce this as:

Negative values may be produced (as it is a linear trnasformation)
These values usually don't sum up to 1 (the constraint is not enforced)

These are achieved with a Softmax function by having the predicted class as the highest value.

p r e d = \arg max_{i} [p_{i}]

since the output vector of values can be used as probabilities for samples of being of class $i$ .

& When implementing softmax, it is not used as the final activation function, but is instead applied in the loss function cross_entropy()

Loss function

Unlike the Log-likelihood cross-entropy function, which only accepts 2 classes, we have to use Multi-class cross entropy function, which is just a variation of the former