Train and test split
Divides the dataset into the training and test sets.
Typically, the train and test split divides 80% of the data for training and 20% for testing.
It is important that the train and test sets follow similar distributions.
When approximating a mean/average metric, like Mean Square Error, we need enough samples so that the empirical approximation matches closely the theoretical value (based on the Law of large numbers)
Train-test-validation split
Good numbers: 80% in training set, 10% in validation, 10% in test set
- Poor values on both train metrics and validation metrics -> Underfitting.
- Good values on train metrics, poor values on validation metrics -> Overfitting.
- Good values on both train and validation metrics -> Great!