Intuition behind ROC-AUC score

Published in

Towards Data Science

10 min readDec 9, 2020

In Machine Learning, classification problem refers to predictive modeling where a class label needs to be predicted for a given observation (record/data point). For example, based on input features such as weather information (humidity, temperature, cloudy/sunny, wind speed, etc.) and time of year, predict whether it is going to “rain” or “not rain” (output variable) today in your city.

ROC-AUC score is one of the major metrics to assess the performance of a classification model. But what does it conceptually mean? In this blog, we will go through the intuition behind ROC-AUC score, and briefly touch upon a contrasting situation between ROC-AUC score and Log-loss score, which is another metric used heavily in assessing the performance of classification algorithms.

In case you seek an understanding of prediction probability and/or log-loss score in context of a binary classifier, feel free to go through my blog Intuition behind Log-loss score.

This blog strives to answer the following questions.

What does ROC-AUC score imply?
How to interpret ROC-AUC values?
How is ROC-AUC conceptually different from Log-loss score?
How is ROC-AUC calculated?
What do terms Confusion Matrix, TPR, FPR, ROC and AUC mean?

What does ROC-AUC score imply?

ROC-AUC is indicative of degree of separability/distinction or intermingling/crossover between the predictions of the two classes. Higher the score, higher the distinction and lower the crossover of the predictions of the two classes.

Let’s consider 10 data samples, 7 of which belong to positive class “1” and 3 to negative class “0”. An algorithm A predicts probabilities for the given 10 samples in the following manner. Given the predictions, we can easily find a probability threshold (for example, at 0.3) and classify every data point correctly under each class. Since there is a clear separation between the predictions of both the classes, ROC-AUC for the classification algorithm is 1.

Now, let’s say algorithm C predicted probabilities in the following manner, moving the third data point further to the right by assigning a probability of 0.3 and moving the fourth data point further to the left by assigning a probability of 0.2.

There is no probability threshold that can neatly classify all the data points correctly. Irrespective of the threshold value (be it 0.15 or 0.35), we get at least one data point on the incorrect side of the threshold line. Hence, ROC-AUC score is less than 1 (precisely, 0.952).

In the same vein, let’s explore the following results of algorithm D. Again, there is no demarcation that can be achieved without classifying some data points incorrectly.

If we choose 0.05 as the threshold, then we risk classifying two data points incorrectly (blue ones on the right of the threshold line). If we choose 0.45 as the threshold, then we classify one data point incorrectly (orange one on the left of the line). Hence, algorithm D has the potential to classify a greater number of data points incorrectly. Hence, ROC-AUC of the algo D (0.905) is even lower than that of algo C (0.952).

Higher crossover/intermingling lowers the ROC-AUC score, indicating less-than-perfect separability/distinction between the predictions of the two classes.

Let’s look at the results of algorithm E that classified the data points as follows. Can you guess the ROC-AUC score of this classification model?

It would seem to be 1 since there is a clear gap between the predictions of the two classes. However, what is not so apparent is that this is an example of perfect crossover.

Notice that the algorithm is doing a very bad job in predicting labels — while class 1 data points are being predicted away from the probability of 1, class 0 data points are being predicted near to the probability of 1. ROC-AUC of the model is 0, indicating intermingling to such an extent that predictions have crossed over completely to the other side.

Higher the crossover/intermingling is, lower is the ROC-AUC score. Hence, a higher ROC-AUC is always regarded better than a lower one.

How to interpret ROC-AUC values?

The score varies from 0 to 1, the latter being the desirable goal for any classification problem. A binary classifier that, in fact, just makes random guesses, would be correct approximately 50% of the time, thereby garnering a ROC-AUC score of roughly 0.5.

While 1 denotes a clear distinction between the predictions of the two classes, 0 denotes a complete crossover. Anything less than 1 indicates some sort of intermingling/crossover of the predictions. A binary classifier is useful only when it achieves ROC-AUC score greater than 0.5 and as near to 1 as possible. If a classifier yields a score less than 0.5, it simply means that the model is performing worse than a random classifier, and hence, is of no use.

Generally, anything above 0.8 is considered good, but the desirable score varies from application to application. That said, a score of 1 is usually an impossible feat to achieve in practical situations.

Contrasting ROC-AUC with Log-loss

Let’s compare the results of two algorithms A and B. The only difference in the predictions of the two algorithms is that the fourth data point (from left) garners a probability of 0.4 with algorithm A as opposed to 0.3 with algorithm B.

In each algorithm, there is a clear demarcation between the predictions of two classes (although B has narrower gap than A). Hence, ROC-AUC score for both the models is 1. However, since the fourth data point has been predicted farther from the actual value of 1, it raises the log-loss score of the model. Hence, the log-loss score of algorithm B (0.342) is greater than that of algorithm A (0.313).

A decline in log-loss score does not necessarily mean a change in ROC-AUC score. As long as the degree of distinction between the predictions of the two classes remains same, ROC-AUC stands still.

However, an increase in ROC-AUC always warrants a decrease in log-loss. Why? Because higher ROC-AUC indicates better degree of separation, which in turn implies that the predictions are being pulled towards their respective class (0 or 1), thereby lowering the log-loss score.

Unlike log-loss score, ROC-AUC doesn’t take into account how far the predictions (probabilities) are from the respective actual classes (0 or 1). It only cares about how much of a separation/distinction or intermingling/crossover exists between the predictions of the two classes.

While Log-loss indicates how far the predictions are from their respective classes, ROC-AUC indicates separability or intermingling of the predictions of the two classes on a probability scale.

Following is the summary snapshot of all the predictions (algorithms A through E) we looked into up until now.

How is ROC-AUC calculated?

Before we go through the derivation of ROC-AUC, we need to understand the meaning of the following terms.

i. Confusion Matrix

ii. True/False Positive/Negative

iii. TPR (True Positive Rate) and FPR (False Positive Rate)

iv. ROC (Receiver Operating Characteristic) graph

v. AUC (Area Under the Curve)

Confusion Matrix

Once the model predicts probability of data points to be classified under class 1 (aka positive class), it classifies them under one of the two classes based on the threshold probability. For example, if two observations are assigned probabilities of 0.25 and 0.55, and the threshold probability is 0.5, then the first one will be assigned class 0 (negative) and the second one class 1 (positive).

We can then assess the performance of the classification by comparing predicted values against actual values and summarizing the results in a 2x2 matrix as shown below.

While TN (True Negative) refers to the number of data points that belong to negative class (0) and have been correctly predicted as negatives, FP (False Positive) refers to the observations that belong to negative class but have been incorrectly predicted as positives (1).

Based on these four base numbers, we can derive a couple of sophisticated metrics that form the foundation of ROC graph — TPR (True Positive Rate) and FPR (False Positive Rate).

TPR and FPR

TPR (aka Recall or Sensitivity) measures the percentage of actual positives that have been correctly identified as positives. In essence, the metric answers the question — When it’s actually positive, how often does the model predict so? Note that the denominator represents the actual number of positives in the dataset.

FPR measures the percentage of actual negatives that have been incorrectly predicted as positives. The metric answers the question — When it’s actually negative, how often does the model predict otherwise? Note that the denominator represents the actual number of negatives in the dataset.

Let’s calculate TPR and FPR on the classification results of an algorithm F using probability threshold as 0.4. Everything on the left of the threshold probability is predicted as negative class and everything on the right as positive class.

In the figure above, the two orange observations on the left are incorrectly assigned to the negative class (False Negatives), and one blue on the right is incorrectly assigned to the positive class (False Positives). We get the following confusion matrix and TPR/FPR values. The base measures (TP and FN) used for the calculation of TPR are in dashed-oval below. Same for FPR.

ROC Graph

As we increase the probability threshold for the trained classification model, we allow less observations to be predicted under the positive class, thereby decreasing both TPs and FPs and decreasing the values of TPR and FPR (remember that the denominators of both the formulae remain constant for a given set of training/hold-out data since they represent the actual number of positives and negatives respectively).

Let’s revisit the results of algorithm F we just discussed above. Upon changing the probability threshold to 0.75 this time, we get TPR and FPR values as 0.429 and 0.0 respectively, which are lower than what we witnessed for threshold of 0.4 a while back.

If we plot the two pairs of TPR and FPR, we get the following graph.

When we generate and plot TPR/FPR pairs for each probability threshold (from 0 to 1), we get a graph known as Receiver Operator Characteristic (aka ROC).

ROC is a commonly used graph that summarizes the performance of a classifier over all possible probability thresholds. It is generated by plotting TPR (y-axis) against FPR (x-axis) while we vary the threshold value used for assigning observations to a given class.

ROC-AUC Score

ROC- AUC score is basically the area under the green line i.e. ROC curve, and hence, the name Area Under the Curve (aka AUC).

The dashed diagonal line in the center (where TPR and FPR are always equal) represents AUC of 0.5 (notice that the dashed line divides the graph into two halves). Recall from our earlier discussion that a model that randomly guesses the predictions has AUC 0.5. If the AUC is greater than 0.5, the model is better than random guessing. Hence, larger the area under the ROC curve (AUC), better the model is. Alternatively, more the curve is towards the top-left corner, better the model is.

Let’s see what the ROC graph for algorithm A looks like.

As expected, the curve covers all the area of the graph, thereby delivering AUC of 1. Such a curve is possible only when there is a clear segregation between the predictions of the two classes. How? When there is a clear distinction/separability, then there are a set of probability thresholds within that gap for which TPR is 100% (i.e. of all the positives, the model has identified them all as positives) and FPR is 0% (i.e. of all the negatives, the model has not misclassified any as positive), which is the top-left corner point of the ROC graph.

Remember the results of the algorithm E, which had ROC-AUC score of 0?

Can you guess what the ROC graph of the predictions would like for algo E? Think about it before peeking into the answer at the end of the blog.

When you plan to embark on your pursuit of advanced (inferential) statistics, feel free to check out my other article on Central Limit Theorem as well. The concept underpins almost every application of advanced statistics.

Should you have any question or feedback, feel free to leave a comment here. You can also reach me through my LinkedIn profile.

Here is the ROC graph of algo E.

Intuition behind ROC-AUC score

Written by Gaurav Dembla