On calculating AUC

Win-Vector Blog 2016-10-08

Recently Microsoft Data Scientist Bob Horton wrote a very nice article on ROC plots. We expand on this a bit and discuss some of the issues in computing “area under the curve” (AUC).

R has a number of ROC/AUC packages; for example ROCR, pROC, and plotROC. But it is instructive to see how ROC plots are produced and how AUC can be calculated. Bob Horton’s article showed how elegantly the points on the ROC plot are expressed in terms of sorting and cumulative summation.

The next step is computing AUC. Obviously computing area is a solved problem. The issue is how you deal with interpolating between points and the conventions of what to do with data that has identical scores. An elegant interpretation of the usual tie breaking rules is: for every point on the ROC curve we must have either all of the data above a given score threshold or none of the data above a given score threshold. This is the issue alluded to when the original article states:

This brings up another limitation of this simple approach; by assuming that the rank order of the outcomes embodies predictive information from the model, it does not properly handle sequences of cases that all have the same score.

This problem is quite easy to explain with an example. Consider the following data.

d <- data.frame(pred=c(1,1,2,2),y=c(FALSE,FALSE,TRUE,FALSE))print(d)## pred     y## 1    1 FALSE## 2    1 FALSE## 3    2  TRUE## 4    2 FALSE

Using code adapted from the original article we can quickly get an interesting summary.

ord <- order(d$pred, decreasing=TRUE) # sort by prediction reversedlabels <- d$y[ord]data.frame(TPR=cumsum(labels)/sum(labels),            FPR=cumsum(!labels)/sum(!labels),           labels=labels,           pred=d$pred[ord])##   TPR       FPR labels pred## 1   1 0.0000000   TRUE    2## 2   1 0.3333333  FALSE    2## 3   1 0.6666667  FALSE    1## 4   1 1.0000000  FALSE    1

The problem is: we need to take all of the points with the same prediction score as an atomic unit (we take all of them or none of them). Notice also TPR is always 1 (an undesirable effect).

We do not really want rows 1 and 3 in our plot or area calculations. In fact the values in row 1 and 3 are not fully determined as they can vary depending on details of tie breaking in the sorting (though the values recorded in rows 2 and 4 can not so vary). Also (especially after deleting rows) we may need to add in ideal points with (FPR,TPR)=(0,0) and (FPR,TPR)=(1,1) to complete our plot and area calculations.

What we want is a plot where ties are handled. Such plots look like the following:

# devtools::install_github('WinVector/WVPlots')library('WVPlots') # see: https://github.com/WinVector/WVPlotsWVPlots::ROCPlot(d,'pred','y',TRUE,'example plot')

NewImage

There is a fairly elegant way to get the necessary adjusted plotting frame: use differencing (the opposite of cumulative sums) to find where the pred column changes, and limit to those rows.

The code is as follows (also found in our sigr library here):

calcAUC <- function(modelPredictions,yValues) {  ord <- order(modelPredictions, decreasing=TRUE)  yValues <- yValues[ord]  modelPredictions <- modelPredictions[ord]  x <- cumsum(!yValues)/max(1,sum(!yValues)) # FPR = x-axis  y <- cumsum(yValues)/max(1,sum(yValues))   # TPR = y-axis  # each point should be fully after a bunch of points or fully before a  # decision level. remove dups to achieve this.  dup <- c(modelPredictions[-1]>=modelPredictions[-length(modelPredictions)],           FALSE)  # And add in ideal endpoints just in case (redundancy here is not a problem).  x <- c(0,x[!dup],1)  y <- c(0,y[!dup],1)  # sum areas of segments (triangle topped vertical rectangles)  n <- length(y)  area <- sum( ((y[-1]+y[-n])/2) * (x[-1]-x[-n]) )  area}

This correctly calculates the AUC.

# devtools::install_github('WinVector/sigr')library('sigr') # see: https://github.com/WinVector/sigrcalcAUC(d$pred,d$y)## [1] 0.8333333

I think this extension maintains the spirit of the original. We have also shown how complexity increases as you move from code known to work on a particular data set at hand, to library code that may be exposed to data with unanticipated structures or degeneracies (this is why Quicksort, which has an elegant description, often has monstrous implementations; please see here for a rant on that topic).