Introduction
In this post we will dig into four very common metrics for evaluating machine learning models and their performance. The metrics we will go through are Accuracy, Precision, Recall and F1 Score. All of the steps needed to calculate these metrics are simple and easy to understand. However we are going to introduce a little bit of statistical lingo that can seem confusing but just take it slow and we will get through this like a breeze!
The Task
You just landed a great job at Aarons Animal Classifiers inc. Your task is to evaluate four different animal image classification models. The metrics that have been asked to evaluate the different models with are Accuracy, Precision, Recall and F1 score. To test the models you have been given six images that the AI models sometimes get confused by (inspiration comes from a meme we saw somewhere). The models that we are looking at are so called binary classification models. This simply means that model can output two different answers, positive or negative (in our case animal or not animal)
Introducing the Animals and Not Animals
Below we see the six images that we are going to work with today and their corresponding correct classification.
True/False, Positive/Negative.
Before we start looking at the metrics we first need to establish some basics. A benefit of the six images(dataset) we have is that we know the associated category, animal or not animal for each image. If we put these into a table with predicted categories as columns and actual categories as rows we have made ourselves a confusion matrix.
The confusion matrix together with the terms in the cells (true positives/negatives and false positives/negatives) are super important for understanding the metrics later. Luckily they are quite simple to understand when we populate them with examples.
Above we see what the confusion matrix would look like if we populate it with the information we currently have. Since we already know what image is of an animal and not we only get true positives and true negatives. If we instead would have predicted that the blueberry muffin was an animal, that would have been a false positive. If we would have predicted that the shepherd dog was a mop, that would have been a false negative.
Okay now that we have covered the concepts of the confusion matrix. Let us head into the metrics and how they work.
Accuracy
Accuracy is the most common metric to be used in everyday talk. Accuracy answers the question “Out of all the predictions we made, how many were true?”
As we will see later, accuracy is a blunt measure and can sometimes be misleading.
Precision
Precision is a metric that gives you the proportion of true positives to the amount of total positives that the model predicts. It answers the question “Out of all the positive predictions we made, how many were true?”
Recall
Recall focuses on how good the model is at finding all the positives. Recall is also called true positive rate and answers the question “Out of all the data points that should be predicted as true, how many did we correctly predict as true?”
As you can see from the definitions of precision and recall they are tightly connected.
F1 Score
F1 Score is a measure that combines recall and precision. As we have seen there is a trade-off between precision and recall, F1 can therefore be used to measure how effectively our models make that trade-off.
One important feature of the F1 score is that the result is zero if any of the components (precision or recall) fall to zero. Thereby it penalizes extreme negative values of either component.
Machine learning model evaluation
Now that we have a little bit of an understanding for what each metric does. Let us look at how to evaluate some models and their predictive output using the metrics. We will look at four different models. You can follow all the calculations made and try them out yourself.
Model 1 (Classifies all images as animal)
This is an important case to consider. Given our very balanced dataset, a model that just predicts everything to be true gives us an accuracy of 50% and also more importantly a Recall of 100% as there are no false negatives. This model is of course not useful at all as it will classify every kind of image as an animal. If the balance between animals and not animals would have been different we would see quite significant differences in the metrics. (more on that in the next blog post).
Model 2 (Classifies all images as not animal)
This is also an interesting type of model as it just classifies everything as not animal (negatives). Here we see that the recall becomes zero since there are no true positives but there are three false negatives (animals predicted to not be animals). Here we also see an interesting example of the relationship between Recall and Precision. The Precision goes towards 100% as the recall goes to zero. This model is of course also useless and since recall is zero we get a good indication of the models uselessness in the f1 score that is zero as well.
Model 3 (Overpredicts Images as not animal)
Now we are starting to look at models that are actually doing something useful. This model manages to correctly classify all the non animals as well as the barn owl and chihuahua. It does however think that the sheep dog is a mop. Precision becomes 100% here as the model does not produce any false positives. We do see that Recall is impaired by the false negative.
Model 4 (Overpredict images as animal)
The last model is similar to model 3 but this one correctly classifies all the animals (true positives) but it mistakes the mop for an animal (false positive) Note that recall now is 100% as the model does not produce any false negatives. This model is the one that produces the highest F1 score (86%).
In conclusion
Hopefully these examples have shed some light on how to evaluate classification models with these common metrics. We have not stated which of the four models that is the best. This is intentional as the interpretation of these metrics should be done together with a consideration of the use case. Model 4 produces the highest F1 score but has the drawback of classifying mops as sheep dogs. For certain use cases that might be fine but it could also be better with a model that tend to underpredict the animals but instead have a higher precision (as in model 3)