Beginner Tutorial: Introduction to KFold Cross-Validation in Machine Learning.

Photo by Franki Chamaki on Unsplash.

Beginner Tutorial: Introduction to KFold Cross-Validation in Machine Learning.

Target Audience

This tutorial is written for beginners in machine learning, looking for better ways to validate their models. You need to be familiar with Python and machine learning libraries such as Pandas and Sckit-Learn to follow through this tutorial.

Tutorial Outline

The tutorial covers the following:

  1. What is Cross-Validation?
  2. Steps involved in Cross-Validation
  3. Types of Cross-Validation.
  4. Implementing KFold cross-validation with Python.


Cross-validation is a technique in applied machine learning where we choose different sets of our data for training and testing ML Algorithms to see how well they will perform with unseen data. Cross-validation is applied when the available data is limited.


  1. Split the data into training and testing sets and evaluate the algorithm performance.
  2. Regroup the data into new training and testing sets and re-evaluate the algorithm performance.
  3. Take the average of the result gotten from each model evaluation.


The following are the various types of cross-validation:

  1. Leave-one-out cross-validation(LOOCV): Every observation in the data trains the ML algorithm, while one is left out to test its performance. This operation continues until each observation has tested the algorithm.

  2. Leave-P-out cross-validation: P number of observations are left out to test the model, while n minus P observations train the ML algorithm. The operation continues until all values present in the data have been used to test the data.

  3. K-Fold cross-validation: KFold cross-validation requires the data placed into k number of sections referred to as folds. The algorithm is trained on the k minus one folds (k – 1) while tested on the left-out fold. A second fold is selected to test the algorithm, while the remaining folds train the data. The process continues until all folds have tested the data.

  4. Stratified K-Fold cross-validation: Okay, you may be wondering the difference between Stratified K-Fold and K-Fold Cross Validation? Though divided into k number of folds, in K-Fold cross-validation, the various folds do not contain the same percentage of all the different classes present in the data. Stratified K-Fold cross-validation solves this by ensuring that data has an equal representation in all folds.

K-Fold Cross-Validation in Python

About the data set

The data set contains the following information about the students:

  1. School.
  2. School type.
  3. School teaching method.
  4. School classroom type.
  5. Number of students in the classroom.
  6. ID number.
  7. Gender.
  8. If the student qualifies for free lunch or not.
  9. Pretest scores.
  10. Post-test scores.

This data set presents a regression task, where you predict the post-test scores based on the features listed above.

Download the data set from here

Loading your data

To load the data set, you would make use of the Pandas Data Frame library. The Pandas library manipulates and analyses data. It helps you see your data at a glance and in detail.

# Import pandas
import pandas as pd

# Load dataset
test_scores = pd.read_csv('test_scores.csv')

# Print the first five rows of the dataset

Check for missing values in the dataset using:



school             False
school_setting     False
school_type        False
classroom          False
teaching_method    False
n_student          False
student_id         False
gender             False
lunch              False
pretest            False
posttest           False

No missing data.

Next, drop the student id column because it won't be needed when predicting the post-test scores and assign the resulting data to the test_scores variable;

test_scores = test_scores.drop(columns = 'student_id')

Next, divide the data set into two different sets;

X = test_scores.drop(columns = 'posttest')

y = test_scores['posttest']

The X variable holds features used in predicting the post-test scores, while the y variable holds the post scores.

Select all features that possess categorical values, to convert them to numerical values;

cat_cols = [i for i in X.columns if X[i].dtype == 'object']



['school', 'school_setting', 'school_type', 'classroom', 'teaching_method', 'gender', 'lunch']

Convert the categorical values to numerical values using pd.get_dummies() in-built method;

X_values = pd.get_dummies(X, columns = cat_cols, drop_first = True)

Next, you import the algorithms and libraries needed for the cross-validation procedure;

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold, train_test_split
from sklearn.metrics import mean_squared_error, r2_score

Before you work on the data set using KFold, check out how well the train_test_split() method performs and compare its performance to the KFold cross-validation method.

Separating the data into the training and testing set;

X_train, X_test, y_train, y_test =  train_test_split(X_values, y, test_size=0.25)

The training set holds 75% of the data and the testing set holds the remaining 25%.

Creating instances of our algorithms;

lin_model = LinearRegression()
dec_tree = DecisionTreeRegressor()
random_reg = RandomForestRegressor(n_estimators=100)

Use the .fit() method to fit the data on these algorithms;, y_train), y_train), y_train)

Then, test the algorithms by predicting the post-test scores based on the data contained in X_test;

lin_predicted = lin_model.predict(X_test)
dec_predicted = dec_tree.predict(X_test)
random_predicted = random_reg.predict(X_test)

See the performance of the algorithms by comparing the actual post-test scores y_test to the predicted ones using the r2 score metric;

lin_score_r2 = r2_score(y_test, lin_pred)
dec_score_r2 = r2_score(y_test, dec_pred)
random_score_r2 = r2_score(y_test, dec_pred)

#Print the results
print(f' Linear Regression: {lin_score_r2} \n Decision Tree Regressor: {dec_score_r2} \n Random Forest Regressor: {random_score_r2}')


 Linear Regression: 0.9551506264255508 
 Decision Tree Regressor: 0.914948254556947 
 Random Tree Regressor: 0.914948254556947

train_test_split method gives r2 scores of 0.956, 0.915, and 0.915 for Linear Regression, Decision Tree Regressor, and Random Forest Regressor, respectively.

Run the KFold validation method on the data and see the scores it will produce.

First, put the models and their names into a list so you can iterate over it using a for loop;

models = [('Linear Regression', LinearRegression()), ('Decison Tree Regressor', DecisionTreeRegressor()), ('Random Forest Regressor', RandomForestRegressor(n_estimators=100))]

Next, implement KFold cross-validation where the number of splits (folds) is 10;

for name, model in models:
    kfold = KFold(n_splits=10, shuffle=True)
    cv_results = cross_val_score(model, X_values, y, cv=kfold, scoring='r2')
    print(f'{name}:  {cv_results.mean()}')


Linear Regression:  0.9578643445613857
Decison Tree Regressor:  0.9195822415600567
Random Forest Regressor:  0.9440996974796947

KFold cross-validation method gives r2 scores of 0.958, 0.920, and 0.944 are for Linear Regression, Decision Tree Regressor, and Random Forest Regressor, respectively.

The algorithms perform better with the KFold cross-validation method compared to the train_test_split method.


Congratulations on finishing this tutorial!

You have successfully learned about cross-validation, its performance compared to the traditional splitting method, and how to apply it when training and testing an ML algorithm on real-world data.

Practice is needed to become better. Challenge yourself, take on more data sets and apply the concepts learned here.

I'm interested to see what you will build!

Feel free to reach out to me via Twitter myrtleXY if you have any questions.