Saturday, 18 February 2023

What is K-fold Cross Validation? Working of Cross validation on K-fold data.

In this tutorial, we will be covering the cross-validation concept, its usage in machine learning methods, and the working of k-fold cross-validation.

 

Cross-validation: 

A resampling technique called cross-validation uses several data subsets to evaluate and train a model across a number of iterations. It is typically used in scenarios where the objective is prediction and one aims to evaluate how well a predictive model will perform in real-world situations.

In machine learning, cross-validation is used to compare various models and their parameters such as Support Vector Machine, and Random Forest. Cross-validation can be a good approach in the case of tuning the hyper-parameters of models to identify the best settings and the best model for a given type of problem. K-fold is a specific way of sampling the data for training the model. In k-fold, k refers to the number of groups that the data gets split into. For example, if you have k equal to 10, then the entire data gets split into 10 groups. Further, let's have 1000 data points and have 10 groups then we have hundred (100)  data points for each group.  One thing is that data scaling and normalization need to be done after the split into training and testing, otherwise you may run into data leakage issues.


You can see the k-fold process on the whole dataset. This figure shows how the training and testing data portion is selected for k-fold cross-validation. Here consider k is 5 and the entire data is split into 5 groups. This visualization shows that the first group of data was selected for testing data and the remaining 4 subgroups of data were selected for training in the 1st iteration. Similarly, 2nd group (fold) of data was selected for testing and the other remaining 4 subgroups were selected for training in the 2nd iteration. If we put attention here then we can observe that subgroup selected for testing is relating to the iteration number like in 1st iteration, the first group was selected for testing data, the second group was selected in the 2nd iteration, and so on. This represents the working of k-fold cross-validation.     

K-fold
Splitting of Data via K-fold


What is a good K value?
There is no straight rule to choose the k value in the k-fold but most of the time, 10 is used for the k value in published papers or any other code. Here you have to remember and ensure one thing the k value results in a representative large enough train and test data.   
Another alternative way to find out the K value can be sensitivity analysis:
Sensitivity analysis can be performed to identify the ideal K value
  • By evaluating the performance of a given model for different k values.
  • For example, Compare the accuracy from different k values against the accuracy of a model trained on almost all the data (k=N). Accuracy may be low for lower k-values and saturate at some specific k-value.    
 

Let's summarize the K-fold split:

  • If the dataset size is 500 data points and k=10
    • Data gets split into 10 folds, and each fold has (500/10 = 50) data points
    • One fold gets used for test/validation and the remaining 9 (k-1) folds are used for training 


How to use K-folds for Cross-Validation?

  • Scikit-learn provides an easy way to perform K-folds splits in python:
                from sklearn.model_selection import KFold
                cv = KFold(n_splits=10, random_state=42, shuffle=True)
  • Cross-validation can be performed either by
    • Iterating over the training and testing splits by deliberately coding using a for loop
                            for train, test in cv.split(all_data):
                                          train the model on each fold
                                          evaluate on the test fold
    • Or using the sklearn.cross_val_score that computes the cross-validation score 
                            scores = cross_val_score(model, X, Y,  cv=10, scoring='accuracy',                                        n_jobs=-1)

Cross-Validation on K-folds

So repeat the above training and evaluation process for all models and corresponding hyperparameters to find the best one.


Summary:

  • The purpose of K-fold cross-validation is:
    • To check the model performance 
    • Not to develop/train a final model
  • Use k=10 (or even) for small data sets and k=5 (or 3) for large data sets
  • Example: You want  to pick the best model among SVM, RF, DT, and Neural Network
    • Perform k-fold cross-validation and identify the one that performs the best
    • Train that specific model with the corresponding best-performing hyperparameters on all data for future use.




No comments:

Post a Comment

What is K-fold Cross Validation? Working of Cross validation on K-fold data.

In this tutorial, we will be covering the cross-validation concept, its usage in machine learning methods, and the working of k-fold cross-v...

Popular Posts