In this tutorial, we will be covering the cross-validation concept, its usage in machine learning methods, and the working of k-fold cross-validation.
Cross-validation:
A resampling technique called cross-validation uses several data subsets to evaluate and train a model across a number of iterations. It is typically used in scenarios where the objective is prediction and one aims to evaluate how well a predictive model will perform in real-world situations.
In machine learning, cross-validation is used to compare various models and their parameters such as Support Vector Machine, and Random Forest. Cross-validation can be a good approach in the case of tuning the hyper-parameters of models to identify the best settings and the best model for a given type of problem. K-fold is a specific way of sampling the data for training the model. In k-fold, k refers to the number of groups that the data gets split into. For example, if you have k equal to 10, then the entire data gets split into 10 groups. Further, let's have 1000 data points and have 10 groups then we have hundred (100) data points for each group. One thing is that data scaling and normalization need to be done after the split into training and testing, otherwise you may run into data leakage issues.
You can see the k-fold process on the whole dataset. This figure shows how the training and testing data portion is selected for k-fold cross-validation. Here consider k is 5 and the entire data is split into 5 groups. This visualization shows that the first group of data was selected for testing data and the remaining 4 subgroups of data were selected for training in the 1st iteration. Similarly, 2nd group (fold) of data was selected for testing and the other remaining 4 subgroups were selected for training in the 2nd iteration. If we put attention here then we can observe that subgroup selected for testing is relating to the iteration number like in 1st iteration, the first group was selected for testing data, the second group was selected in the 2nd iteration, and so on. This represents the working of k-fold cross-validation.
Splitting of Data via K-fold |
- By evaluating the performance of a given model for different k values.
- For example, Compare the accuracy from different k values against the accuracy of a model trained on almost all the data (k=N). Accuracy may be low for lower k-values and saturate at some specific k-value.
Let's summarize the K-fold split:
- If the dataset size is 500 data points and k=10
- Data gets split into 10 folds, and each fold has (500/10 = 50) data points
- One fold gets used for test/validation and the remaining 9 (k-1) folds are used for training
How to use K-folds for Cross-Validation?
- Scikit-learn provides an easy way to perform K-folds splits in python:
cv = KFold(n_splits=10, random_state=42, shuffle=True)
- Cross-validation can be performed either by
- Iterating over the training and testing splits by deliberately coding using a for loop
- Or using the sklearn.cross_val_score that computes the cross-validation score
Summary:
- The purpose of K-fold cross-validation is:
- To check the model performance
- Not to develop/train a final model
- Use k=10 (or even) for small data sets and k=5 (or 3) for large data sets
- Example: You want to pick the best model among SVM, RF, DT, and Neural Network
- Perform k-fold cross-validation and identify the one that performs the best
- Train that specific model with the corresponding best-performing hyperparameters on all data for future use.