R style tip: prefer functions that return data frames

Win-Vector Blog 2014-06-06

While following up on Nina Zumel’s excellent Trimming the Fat from glm() Models in R I got to thinking about code style in R. And I realized: you can make your code much prettier by designing more of your functions to return data.frames. That may seem needlessly heavy-weight, but it has a lot of down-stream advantages.

The usual mental model of R’s basic types start with the scalar/atomic types like doubles precision numbers. R doesn’t actually expose routinely such a type to users as what we think of as numbers in R are actually length one arrays or vectors. So you can easily write functions like the following:

typical <- function(x) { mean(x) }print(typical(c(1,2,3,4)))## [1] 2.5

You eventually evolve to wanting functions that return more than one result and the standard R solution to this is to use a named list:

typical <- function(x) { list(mean=mean(x),median=median(x)) }print(typical(c(1,2,3,4)))## $mean## [1] 2.5#### $median## [1] 2.5

Consider, however, returning a data.frame instead of a list:

typical <- function(x) { data.frame(mean=mean(x),median=median(x)) }print(typical(c(1,2,3,4)))##   mean median## 1  2.5    2.5

What this allows is convenient for-loop free batch code using plyr‘s adply() function:

library(plyr)d <- list(x=c(1,2,3,4),y=c(5,6,700))print(adply(d,1,typical))##    X1  mean median##  1  x   2.5    2.5##  2  y 237.0    6.0

You get convenient for-loop free code that collects all of your results into a single result data.frame. You also get real flexibility in that your underlying function can (in addition to returning multiple columns) can safely return multiple (or even varying numbers of) rows. We don’t use this extra power in this small example.

We did need to handle multiple rows when generating run-timings of the step() function applied to a lm() model. The microbenchmark suite runs an expression many times to get a distribution of run times (run times are notoriously unstable, so you should always report a distribution or summary of distribution of them). We ended up building a function called timeStep() which timed a step-wise regression of a given size. The data.frame wrapping allowed us to easily collect and organize the many repetitions applied at many different problem sizes in a single call to adply:

timeStep <- function(n) {  dTraini <- adply(1:(n/dim(dTrainB)[[1]]),1,function(x) dTrainB)  modeli <- lm(y~xN+xC,data=dTraini)  data.frame(n=n,stepTime=microbenchmark(step(modeli,trace=0))$time)}plotFrameStep <- adply(seq(1000,10000,1000),1,timeStep)

(See here for the actual code this extract came from, and here for the result.)

This is much more succinct than the original for-loop solution (requires a lot of needless packing and then unpacking) or the per-column sapply solution (which depends on the underlying timing returning only one row and one column; which should be thought of not as natural, but as a very limited special case). With the richer data.frame data structure you are not forced to organize you computation as an explicit sequence over rows or an explicit sequence over columns. You can treat things as abstract batches where intermediate functions don’t need complete details on row or column structures (making them more more reusable).

In many cases data-frame returning functions allow more powerful code as they allow multiple return values (the columns) and multiple/varying return instances (the rows). Adding such funcitons to your design toolbox allows for better code with better designed separation of concerns between code components. Also it sets things up in very plyr friendly format.

Note: Nina Zumel pointed out that some complex structures (like complete models) can not always be safely returned in data.frames, so you would need to use lists in that case.

An interesting example of this is POSIXlt. Compare print(class(as.POSIXlt(Sys.time()))) print(class(data.frame(t=as.POSIXlt(Sys.time()))$t)), and d <- data.frame(t=0); d$t <- as.POSIXlt(Sys.time()); print(class(d$t)).