Efficient accumulation in R

Win-Vector Blog 2015-07-28

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, ScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

NewImage Accumulating wheat (Photo: Cyron Ray Macey, some rights reserved)

In this latest “R as it is” we will quickly become expert at efficiently accumulating results in R.

A number of applications (most notably simulation) require the incremental accumulation of results prior to processing. For our example, suppose we want to collect rows of data one by one into a data frame. Take the mkRow function below as a simple example source that yields a row of data each time we call it.

mkRow <- function(nCol) {  x <- as.list(rnorm(nCol))  # make row mixed types by changing first column to string  x[[1]] <- ifelse(x[[1]]>0,'pos','neg')  names(x) <- paste('x',seq_len(nCol),sep='.')  x}

The obvious “for-loop” solution is to collect or accumulate many rows into a data frame by repeated application of rbind. This looks like the following function.

mkFrameForLoop <- function(nRow,nCol) {  d <- c()  for(i in seq_len(nRow)) {    ri <- mkRow(nCol)    di <- data.frame(ri,                     stringsAsFactors=FALSE)    d <- rbind(d,di)  }  d}

This would be the solution most familiar to many non-R programmers. The problem is: in R the above code is incredibly slow.

In R all common objects are usually immutable and can not change. So when you write an assignment like “d <- rbind(d,di)” you are usually not actually adding a row to an existing data frame, but constructing a new data frame that has an additional row. This new data frame replaces your old data frame in your current execution environment (R execution environments are mutable, to implement such changes). This means to accumulate or add n rows incrementally to a data frame, as in mkFrameForLoop we actually build n different data frames of sizes 1,2,...,n. As we do work copying each row in each data frame (since in R data frame columns can potentially be shared, but not rows) we pay the cost of processing n*(n+1)/2 rows of data. So: no matter how expensive creating each row is, for large enough n the time wasted re-allocating rows (again and again) during the repeated rbinds eventually dominates the calculation time. For large enough n you are wasting most of your time in the repeated rbind steps.

To repeat: it isn’t just that accumulating rows one by one is “a bit less efficient than the right way for R”. Accumulating rows one by one becomes arbitrarily slower than the right way (which should only need to manipulate n rows to collect n rows into a single data frame) as n gets large. Note: it isn’t that beginning R programmers don’t know what they are doing; it is that they are designing to the reasonable expectation that data frame is row-oriented and R objects are mutable. The fact is R data frames are column oriented and R structures are largely immutable (despite the syntax appearing to signal the opposite), so the optimal design is not what one might expect.

Given this how does anyone ever get real work done in R? The answers are:

  • Experienced R programmers avoid the for-loop structure seen in mkFrameForLoop.
  • In some specialized situations (where value visibility is sufficiently limited) R can avoid a number of the unnecissary user specified calculations by actual in-place mutation (which means R can in some cases change things when nobody is looking, so only observable object semantics are truly immutable).

The most elegant way to avoid the problem is to use R’s lapply (or list apply) function as shown below:

mkFrameList <- function(nRow,nCol) {  d <- lapply(seq_len(nRow),function(i) {    ri <- mkRow(nCol)    data.frame(ri,               stringsAsFactors=FALSE)  })  do.call(rbind,d)}

What we did is take the contents of the for-loop body, and wrap them in a function. This function is then passed to lapply which creates a list of rows. We then batch apply rbind to these rows using do.call. It isn’t that the for-loop is slow (which many R users mistakingly believe), it is the incremental collection of results into a data frame is slow and that is one of the steps the lapply method is avoiding. While you can prefer lapply to for-loops always for stylistic reasons, it is important to understand when lapply is in fact quantitatively better than a for-loop (and to know when a for-loop is in fact acceptable).

If you don’t want to learn about lapply you can write fast code by collecting the rows in a list as below.

mkFrameForList <- function(nRow,nCol) {  d <- as.list(seq_len(nRow))  for(i in seq_len(nRow)) {    ri <- mkRow(nCol)    di <- data.frame(ri,                     stringsAsFactors=FALSE)    d[[i]] <- di  }  do.call(rbind,d)}

The above code still uses a familiar for-loop notation and is in fact fast. Below is a comparison of the time (in MS) for each of the above algorithms to assemble data frames of various sizes. The quadratic cost of the first method is seen in the slight upward curvature of its smoothing line.

NewImage Execution time (MS) for collecting a number of rows (x-axis) for each of the three methods discussed. Slowest is the incremental for-loop accumulation.

The reason mkFrameForList is fast is in some situations R can avoid creating new objects and in fact manipulate data in place. In this case the list “d” is not in fact re-created each time we add an additional element, but in fact mutated or changed in place. However this in-place change (without which the above code would again be unacceptably slow) depends critically on the list value associated with “d” having very limited visibility. Even copying this value to another variable or passing it to another function can break the visibility heuristic and cause arbitrarily expensive object copying. The fragility of the visibility heuristic is best illustrated with an even simpler example.

Consider the following code that returns a vector of the squares of the first n positive integers.

computeSquares <- function(n,messUpVisibility) {  # pre-allocate v  # (doesn't actually help!)  v <- 1:n  if(messUpVisibility) {     vLast <- v  }  # print details of v  .Internal(inspect(v))  for(i in 1:n) {    v[[i]] <- i^2    if(messUpVisibility) {      vLast <- v    }    # print details of v    .Internal(inspect(v))  }  v}

Now of course part of the grace of R is we never would have to write such a function. We could do this very fast using vector notation such as seq_len(n)^2. But let us work with this notional example.

Below is the result of running computeSquares(5,FALSE). In particular look at the lines printed by the .Internal(inspect(v)) statements and at the first field of these lines (which is the address of the value “v” refers to).

computeSquares(5,FALSE)## @7fdf0f2b07b8 13 INTSXP g0c3 [NAM(1)] (len=5, tl=0) 1,2,3,4,5## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,2,3,4,5## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,3,4,5## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,4,5## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,5## @7fdf0e2ba740 14 REALSXP g0c4 [NAM(1)] (len=5, tl=0) 1,4,9,16,25## [1]  1  4  9 16 25

Notice that the address v refers to changes only once (when the value type changes from integer to real). After the one change the address remains constant (@7fdf0e2ba740) and the code runs fast as each pass of the for-loop alters a single position in the value referred to by v without any object copying.

Now look what happens if we re-run with messUpVisibility:

computeSquares(5,TRUE)## @7fdf0ec410e0 13 INTSXP g0c3 [NAM(2)] (len=5, tl=0) 1,2,3,4,5## @7fdf0d9718e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,2,3,4,5## @7fdf0d971bb8 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,3,4,5## @7fdf0d971c88 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,4,5## @7fdf0d978608 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,5## @7fdf0d9788e0 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 1,4,9,16,25## [1]  1  4  9 16 25

Setting messUpVisibility causes the value referenced by “v” to also be referenced by a new variable named “vLast“. Evidently this small change is enough to break the visibility heuristic as we see the address of the value “v” refers to changes after each update. Meaning we have triggered a lot of expensive object copying. So we should consider the earlier for-loop code a bit fragile as small changes in object visibility can greatly change performance.

The thing to remember is: for the most part R objects are immutable. So code that appears to alter R objects is often actually simulating mutation by expensive copies. This is the concrete reason that functional transforms (like lapply) should be preferred to incremental or imperative appends.

R code for all examples in this article can be found here.