Introduction to Hyperparameter Tuning in Machine Learning.

Introduction to Hyperparameter Tuning in Machine Learning.

Strive for continuous improvement, instead of perfection. - Kim Collins.

Data Scientists and Machine Learning Engineers often contemplate the model and parameters to produce the best accuracy after data preprocessing. Hyperparameter tuning is the process or method used to find out the best parameters for a model. In this article, you will learn the various methods of hyperparameter tuning and their implementation. A working understanding of python and DS libraries like Numpy will be useful in following this article through.

Hyperparameter Tuning using Cross Validation Score Object

Cross-Validation Score is an object in sklearn.model_selection that tunes your model manually by taking in the algorithm, its parameters, the independent features (X), and, the dependent feature (y) of your data. It then performs cross-validation with the given input and returns n number of accuracy scores, where n is the number of splits in cross-validation. See the example below.

Code Example

Import the Pandas and Numpy library to begin with.

import pandas

The BMI index dataset from Kaggle will be used in this article. You can download it here. The goal of this dataset is to predict a person's Body Mass Index(BMI) based on gender, height, and age. In the real world, these features may be insufficient, but for this tutorial, they will suffice.

First, read the dataset and print out the first five rows to see how the data is structured.

bmi_data = pd.read_csv('bmi.csv')



Next, replace the Male and Female data in the gender column with 0 and 1, respectively, as most machine learning algorithms only work on numerical data. Also, display the first five rows using bmi_data.head.

bmi_data = bmi_data.replace(to_replace = ['Male', 'Female'], value = [0,1])



Next, separate the independent and dependent features when training a model. Dependent(y) features are those you want to predict and are dependent on other variables for their values, while your independent(X) features are those used to predict dependent feature(s) values.

Index is the dependent variable in the dataset, while Gender, Height, and, Weight are the independent variables. You can perform this separation using Pandas .iloc[] indexing attribute.

X = bmi_data.iloc[:,0:3] # Independent variable
y = bmi_data.iloc[:,3] #Dependent Variable

Now you are ready to tune your model.

Import the cross_val_score object from sklearn.model_selection.

from sklearn.model_selection import cross_val_score

Next, import the algorithm the Decision Tree Classifier algorithm for training, testing, and building the model.

from sklearn.tree import DecisionTreeClassifier

Next, input the algorithm, its parameters, the independent and dependent features, and the number of splits in cross-validation(cv).

The min_samples_split and max_depth parameters of the Decision Tree Classifier will be varied to find which value(s) give better accuracy.

cross_val = cross_val_score(DecisionTreeClassifier
          (min_samples_split = 3, 
            max_depth = 3), 
            X, y, cv = 5)

Use Numpy .mean() attribute to display the average of the result.

print(cross_val, np.mean(cross_val))



The mean accuracy was 0.654 for a min_samples_split of 3 and max_depth of 3.

Run this same code but with a min_samples_split of 5 and a max_depth of 7, to check if there will be any improvement.

cross_val_ = cross_val_score(DecisionTreeClassifier(min_samples_split = 5, max_depth = 7), X, y, cv = 5)
print(cross_val_, np.mean(cross_val_))



The accuracy went up from an average of 0.654 to an average score of 0.8. Quite impressive.

You could keep on choosing different values for different parameters to see which produces the right fit, but that would take a lot of time. Using a for loop will make things more complex because you would have to write multiple for statements for every additional parameter you choose to add. Luckily, sklearn.model_selection comes with a package called GridSearchCV, which allows you to select as many parameters and values as you want, helping you tune your model with few lines of code.

Hyperparameter Tuning using GridSearchCV

To get started, import GridSearchCV for sklearn.model_selection

from sklearn.model_selection import GridSearchCV

GridSearchCV takes in the model you want to use on your data, followed by a dictionary containing your parameters and their values, then the number of cross-validation splits you want to perform.

Next, create an instance of the GridSearchCV object and input the values stated above to it.

clf_decision_tree = GridSearchCV(DecisionTreeClassifier(), {
    'max_depth' : [1,3,5,7,9],
    'min_samples_split' : [2,3,4],
}, cv=5, return_train_score=False)

The max_depth parameter takes in 5 values, while the min_samples_split parameter takes in 3. The option provided by GridSearchCV to compute multiple values at a go, makes it more efficient than cross_val_score.

Next, fit the data using the .fit() method and display the result using the .cv_results_ attribute.,y)



Pass the result into a dataframe for better visualization using Pandas .DataFrame() method

grid_results = pd.DataFrame(clf_decision_tree.cv_results_)



The .cv_results_ attribute return values such as the mean test score, the parameters, individual accuracy of each split performed, and so on. You can view all the columns it returns using the .columns attribute. Not all these values are necessary, just the parameters and their mean test scores. To get this, index the params and mean_test_score features.

grid_results[['params', 'mean_test_score']]



The result above shows the best accuracy is achieved when the max_depth was set to 9 and min_samples_split was set to 2.

GridSearchCV is a great tool to use when there is the confidence that the best accuracy will be gotten using a few sets of values. The limitation of GridSearchCV is that it can't be used for a wide range of values. For example, setting the min_samples_split to take numbers from 1 to 100 will be computationally expensive. To solve this problem, whenever hyperparameter tuning is performed for a wide range of values, RandomizedSearchCV is employed.

Hyperparameter Tuning using RandomizedSearchCV

RandomizedSearchCV works exactly like GridSearchCV except that it tunes the model based on n number of random values where n is the number of iterations you choose.

Get started by importing RandomizedSearchCV from sklearn.model_selection

from sklearn.model_selection import RandomizedSearchCV

Next, create an instance of RandomizedSearchCV and pass in the parameters similar to that of GridSearchCV. The number of iterations parameter n_iter will be set to two. You can use any value of your choice.

random_cv = RandomizedSearchCV(DecisionTreeClassifier(), {
    'max_depth' : [1,3,5],
    'min_samples_split' : [2,3,4],
    }, cv=5, 
       n_iter = 2

Next, fit your data to random_cv and display the results in a DataFrame for easy visualization. X,y)
random_cv_df = pd.DataFrame(random_cv.cv_results_)



Index the params and mean_test_score values to have a clearer picture of the data.

random_cv_df[['params', 'mean_test_score']]



Using max_depth values of 2 and 9 and `min_samples_split' values of 3 and , give accuracies of 0.58 and 0.838.

Rerun the code.



Repeating again with max_depth values of 3 and 9 and `min_samples_split' values of 4 and 2 to give accuracies of 0.58 and 0.834. The results do not differ so much, but you get the idea.

Choosing the best model with the right Parameters

Finding the appropriate model is as important as finding the best parameters. GridSearchCV will be used to find this best model with the best parameter in this article, though both GridSearchCV and RandomizedSearchCV can be used.

DecisionTreeClassifier, RandomForestClassifier and, AdaBoostClassifier are the three algorithms to be tested on this dataset, along with two of their parameters.

Import the algorithm libraries to get started. You can exclude importing DecisionTreeClassifier as it has been done above.

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

Next, create a list containing instances of the algorithms and the parameters to be tested embedded in a dictionary.

best_model = [ 
        'model' : DecisionTreeClassifier(),
        'parameters': {
            'min_samples_split' : [1.0, 3, 5],
            'max_depth' : [2, 4,6]

        'model' : AdaBoostClassifier(),
        'parameters': {
            'algorithm' : ['SAMME', 'SAMME.R'],
            'learning_rate' : [1,2,3]

        'model' : RandomForestClassifier(),
        'parameters': {
            'min_samples_split' : [1.0, 3, 5],
            'class_weight' : ['balanced', 'balanced_subsample']

To know more about the parameters of each algorithm.

Next, go through the list and input the model and its parameters to GridSearchCV for tuning of the model. Also, append the best parameters and scores to a list.

best_model_params = []
best_model_scores = []
for model in best_model:
    clf_best = GridSearchCV(model['model'], model['parameters'], cv=5, return_train_score=False),y)

Next, print the best_model_params and the best_model_scores list




RandomForestClassifier returned the best accuracy with class_weight set to balanced and min_samples_split to 5.


Hyperparameter Tuning is needed when building a model because it gives the optimal parameters that produce the best accuracy. You get an unfair advantage when participating in competitions on platforms like Kaggle and Zindi using Hyperparameter Tuning, placing you higher on the leaderboard.

Moving forward, you can read on other preprocessing methods such as

I hope you learned the basics of Hyperparameter Tuning and how to implement it on data. Tuning your parameters will enable you to get the best out of your model.

If you have any questions, please don't hesitate to reach me on Twitter: @ee_ephraim.