Multi-class classification

Unlike Binary classification, the multi-class classification model has to output N number of probability values pi, where each pi corresponds to the probability of being of class i.

The pi should represent probabilities and hence $$\sum^{9}_{i=0}p_i = 1$$
Dense layers W=YX+b cannot produce this as:

These are achieved with a Softmax function by having the predicted class as the highest value.

pred=argmaxi[pi]

since the output vector of values can be used as probabilities for samples of being of class i.

Loss function

Unlike the Log-likelihood cross-entropy function, which only accepts 2 classes, we have to use Multi-class cross entropy function, which is just a variation of the former