You don’t need to understand pointers to program using R

Win-Vector Blog 2014-04-02

R is a statistical analysis package based on writing short scripts or programs (versus being based on GUIs like spreadsheets or directed workflow editors). I say “writing short scripts” because R’s programming language (itself called S) is a bit of an oddity that you really wouldn’t be using except it gives you access to superior analytics data structures (R’s data.frame and treatment of missing values) and deep ready to go statistical libraries. For longer pure programming tasks you are better off using something else (be it Python, Ruby, Java, C++, Javascript, Go, ML, Julia, or something else). However, the S language has one feature that makes it pleasant to learn (despite any warts): it can be initially used and taught without having the worry about the semantics of references or pointers.

In our new book (Practical Data Science with R) we didn’t get into the lack of pointers for a purely didactic reason. To tell a general audience (perhaps one new to scripting or programming) that they don’t need to know about pointers, we would have to first explain what pointers are (somewhat losing the cognitive savings). We settled for demonstrating R’s (primarily) call by value semantics for functions (which we already needed to explain) with the following example:

> vec <- c(1,2)> fun <- function(v) { v[[2]]<-5; print(v)}> fun(vec)[1] 1 5> print(vec)[1] 1 2

Notice how the mutation (changing an entry to 5) does not escape the function as a side effect. Because R is a bit of kitchen sink (everything and its opposite is pretty much available) we had to cautiously title this example as “R behaves like a call-by-value language” in our book (R in fact has a number of sharable reference structures including environments, ReferenceClasses, lazy evaluation systems like promises/delayedAssign, and more). (The ugly [[]] notation is something we recommend as it catches a few more errors than the more common [] notation. For details please see appendix A of our book.)

What we didn’t discuss is that you get this sort of change isolation and safety in R in just about every situation (not just when binding values to function arguments). Here is another example (this time not from the book):

> vec <- c(1,2)> v2 <- vec> v2[[2]] <- 5> print(v2)[1] 1 5> print(vec)[1] 1 2

Unlike many languages the assignment “v2 <- vec” does not end up with vec and v2 as references (or pointers) entangled to the same object. Instead they behave as if they are two different objects. This does prevent using these two symbols to communicate results (a legitimate programming practice) but it also prevents a whole host of errors and confusions that beginning programmers run into in the presence of such shared mutability. R protects the programmer by treating objects directly without exposing the additional ideas of references or pointers. Many ideal functional programming languages more directly expose references but mitigate their danger by insisting on immutable structures; but this requires the user to learn (in addition to data handling, statistics and programming) the fairly alien discipline of composing immutable data structures.

We encourage beginning programmers to think of programs as organizing sequences of transformations over data. So the simpler (and fewer) the mutations are, the easier it is to reason about programs. When you program in R you are mostly working with values and not variables (which is good, as it leaves you more time to think about data). So, as much as we complain about R, it is in fact a good choice for teaching, analysis, data science and even basic scripting tasks.

However, you do eventually have to deal with the unpleasant details of side-effects and shared mutability. One place where R doesn’t hide the sharp edges from you is in closures (the structure R uses to represent the context of a function). Consider the following code puzzle where we wonder what gets printed by the following:

# make an array of 3 functionsf <- vector('list',3)# set the i'th function to return ifor(i in 1:length(f)) {  f[[i]] <- function() { i }}# apply the functions using a different loop variablefor(j in 1:length(f)) {  print(f[[j]]())}

Note this is one place where you really do need to use the uglier [[]] notation. In the current version of R (3.0.2) if you try to use [] you get the error message “cannot coerce type ‘closure’ to vector of type ‘list’.” But the puzzle is: what do you expect to be printed. If R was binding the value of i into the i‘th function you would expect to see the sequence “1,2,3.” Instead each function in fact gets its value for i by using what is current in its capture of the evaluation environment. So this code in fact prints “3,3,3″, as this is the value i has after the first loop is finished. This is unfortunate, as a lot of productive programming patterns depend on capturing safe isolated values- not capturing entangled references.

This sort of puzzle may seem unpleasant and unnatural, but when pointers (and other sort of shared references) are involved you are forced to solve this sort of puzzle to understand the meaning or semantics of a code fragment or program. It is because these puzzles are laborious that languages like R emphasize isolation, so there is much less to worry about when you try to compose useful data transformations.

Closures and environments are very powerful tools (many of R’s features and built in terms of them). And this common shared mutability of them is a huge source of confusion in many programming languages (Javascript also has this issue, and Java only allows closures to capture final variables to try and cut down on some of the possible interference). To get the behavior we want (each function capturing the current value of i in its closure and not sharing a common reference) we can write the following code:

f <- vector('list',3)for(i in 1:length(f)) {  f[[i]] <- function() { i }  e <- new.env()  assign('i',i,envir=e)  environment(f[[i]]) <- e}for(j in 1:length(f)) {  print(f[[j]]())}

And this prints 1,2,3 as we would hope. Note we are now in very deep programming ground (closures being at least as confusing to beginners as pointers) and no longer even thinking about data. We have to admit: we really counted to 3 the hard way.