#note/reference

Activation functions

They are an additional function $A = s (Z)$ that occurs after a linear operation $Z = W X + B$ to introduce non-linearity to models.

Examples

These two activation functions are often used in binary classification models, are reminiscent of Logistic regression and are simply differentiable

However, as the outputs are squeezed between $[0, 1]$ or $[- 1, 1]$ , training deep networks with them would cause Vanishing gradients

To allow for multi-class classification:

Softmax function

Out-of-syllabus