Understanding Overfitting and Underfitting in Machine Learning.

Understanding Overfitting and Underfitting in Machine Learning.

It's all about finding the right fit when building your model.

Introduction

When building models, it is common practice to evaluate performance of the model. Model accuracy is a metric used for this. This metric checks how well an algorithm performed over a given data, and from the accuracy score of the training and test data, we can determine if our model is high bias or low bias, high variance or low variance, underfitting, or overfitting.

You need to be familiar with building machine learning models to understand this article well.

Bias

Bias refers to the model accuracy when tested with the training set. When building a model generally, we split our data into training test and testing sets. The algorithm, let's say Linear Regression, learns from the training set, while the testing set checks how well the algorithm learned the data.

A model has a high bias when the accuracy of the training set is low and a low bias when the accuracy of the training set is high. Sounds counterintuitive? Let’s simplify it with this scenario.

Two developers, Mark and Bill, were invited for a coding test. They received an email the night before containing practice questions to help prepare for the test. The next day they show up and are presented with the exact questions from the night before. Mark scored 60% on the test, while Bill scored 96%. We can agree that Mark had a higher error margin than Bill. Therefore, we can conclude that Mark has a higher bias than Bill because he scored lower. Here’s how it relates to a dataset.

After being trained on the training set, which usually makes up 70% of the data, we use the exact data to test the model, similar to how Mark and Bill, being tested with the same questions from the previous night. On testing, if the model accuracy is 60%, the accuracy is low and, the error margin, 40%, is high.

Bias is directly related to the error margin from testing our data with the training set. A high error margin of 40% means our model is high bias. A training accuracy of 96%, with an error margin of 4%, has a low Bias because the error margin is low.

Variance

To explain the concept of variance, let’s refer to our previous example. Mark and Bill scored 60% and 96% on the questions from the night before. The recruiter then decided to administer a different set of questions, similar to the previous questions but not the same. On grading, Mark had a score of 56%, while Bill had a score of 66%. We see here that the difference between Mark’s previous score (60%) and the current one (56%) is not much, while the difference between Bill’s former score (96%) and the current one (56%) is high.

To know the variance of a model, we need to evaluate it using the test set, which comprises the remaining 30% of the dataset. If the accuracy of the training set is 60%, and the accuracy with the test set is 56%, we can say that the model has low variance. Why? Because the numerical difference between the accuracy score of the training set and the test set is minimum (4%). So, there you have it, variance is the difference between the accuracy of the training set and the test set. If the difference is relatively high, the model has a high variance. Else, it has low variance. I hope you got that?

Once you understand bias and variance, the concept of underfitting and overfitting will be a walk in the park.

Underfitting

Using the scenario above, Mark is an example of an underfitting model. Underfitting occurs when the algorithm loosely learns the dataset. Meaning, when given a training set to learn from, it does not do a good job when learning the patterns in the dataset, therefore, given its poor performance when evaluated against both the training set and the test set. The diagram below gives a graphical explanation of underfitting.

1.png

The line above failed to classify the two different sets of data well, which results in the poor accuracy of the model. Underfitting models are highly biased due to the high error margin from the accuracy score of the training set. They also have low variance because the model performs poorly on the training and test set. Hence, the difference in training and test set is little.

Overfitting.

Overfitting occurs when the model learns both the details and the noise in the data. Overfitting is similar to the case of Bill. Bill performed well when presented with questions from the previous night but poorly when given new but similar questions. The diagram below gives a graphical explanation of overfitting.

UNDERFITTING..png

The model here fits the data too well, which results in poor performance when tested with data it has not see. Overfitting models have a low bias because they learned the training data too well, resulting in high training accuracy, low error margin, and a low bias. These models have high variance because they perform so well on the training data and poorly on the test data, resulting in a high difference in their accuracies.

Conclusion

In summary, you do not want your model to be under-fitted or overfit. Your goal when building a model should be to develop a generalized model that performs well both on data it has seen (training data) and data it has not seen (test data) to give accurate predictions.

Moving forward, you can learn about the solutions to underfitting and overfitting, such as

I hope you learned something about bias, variance, overfitting and, underfitting. Thank you for taking the time to read through. Make sure to share this with anyone who will need it, and till we meet again, stay safe.

You can find me on Twitter.