For loops in R can lose class information

Win-Vector Blog 2016-03-25

Did you know R‘s for() loop control structure drops class annotations from vectors?

Consider the following code R code demonstrating three uses of a for-loop that one would expect to behave very similarly.

dates <- c(as.Date('2015-01-01'),as.Date('2015-01-02'))

for(ii in seq_along(dates)) {
  di <- dates[ii]
  print(di)
}
## [1] "2015-01-01"
## [1] "2015-01-02"

for(di in as.list(dates)) {
  print(di)
}
## [1] "2015-01-01"
## [1] "2015-01-02"

for(di in dates) {
  print(di)
}
## [1] 16436
## [1] 16437

Notice in the third for loop the di print as numbers. This is because running through the dates in this way loses the class annotations. To me this is a huge undesirable surprise (given that indexing does not lose class information). Remember with the class information missing many more behaviors that just printing may be broken. The third loop is the “most natural” as it doesn’t introduce an index or re-process the vector prior to iterating. But the third loop seems to not be safe to use with code that depends on class annotations being preserved.

The work arounds are shown prior to the failure (introducing an index or converting the vector into a list).

Also notice vapply() also loses the class info, even when you explicitly supply it:

vapply(dates,function(x) {x+0},as.Date(0))
## [1] 16436 16437

I understand the vapply() case, as is.vector(dates) is false (due to the class annotation) and one would expect to return something that has is.vector() true. But the for-loop behavior is a real head-scratcher that took a while to believe was actually happening when a partner ran into it.

For a lot of languages for(di in dates) { ... } is roughly syntactic sugar for for(ii in seq_along(dates)) { di <- dates[ii]; ... } (assuming ii is not used elsewhere in the code, i.e. we have a so-called “hygienic” substitution). So it is a big surprise that these two code fragments behave so differently in R. It also make R a bit harder to teach.

I didn’t anticipate these issues from R’s basic documentation which says:

For each element in vector the variable name is set to the value of that element and statement1 is evaluated.

I am guessing this is to be read with the additional understanding that dates[i] is not the just value in the i-th position of dates but in fact a length-1 vector with the same class annotation as dates containing that value (so think of [] is performing extra work to copy over this information). Of course the for loop can’t be working only through “naked” values as R doesn’t expose scalars to user code, therefore any values a for-loop is iterating over must also appear re-wrapped in a vector structure.

If anybody has some good teaching material on this I’d love to see it.