R Tip: Think in Terms of Values

Win-Vector Blog 2018-04-02

R tip: first organize your tasks in terms of data, values, and desired transformation of values, not initially in terms of concrete functions or code.

I know I write a lot about coding in R. But it is in the service of supporting statistics, analysis, predictive analytics, and data science.

R without data is like going to the theater to watch the curtain go up and down.

(Adapted from Ben Katchor’s Julius Knipl, Real Estate Photographer: Stories, Little, Brown, and Company, 1996, page 72, “Excursionist Drama 2”.)

Usually you come to R to work with data. If you think and plan in terms of data and values (including introducing more data to control processing) you will usually work in much faster, explainable, and maintainable fashion.

A simple example

Let’s start with a typical dplyr example. Suppose we wish to select two columns (in this case c("name", "height")) from a data.frame (in this case dplyr::starwars). This is accomplished easily as we show below.

library("dplyr")

starwars %>%
  select(name, height)
# # A tibble: 87 x 2
#    name               height
#    <chr>               <int>
#  1 Luke Skywalker        172
#  2 C-3PO                 167
#  3 R2-D2                  96
#  4 Darth Vader           202
#  5 Leia Organa           150
#  6 Owen Lars             178
#  7 Beru Whitesun lars    165
#  8 R5-D4                  97
#  9 Biggs Darklighter     183
# 10 Obi-Wan Kenobi        182
# # ... with 77 more rows

Our advice

In practice we recommend coding only after you have decided on what you are going to do, and what parameters specify what your steps.

Once you get to coding, in our opinion intent is much clearer if you organize your make things explicit. For example, if you are working with magrittr pipes: make the pipe input argument explicit with “.” (please see R Tip: Make Arguments Explicit in magrittr/dplyr Pipelines). And if you are workign with dplyr::select(): make the argument roles explicit. We suggest collecting the column names into a separate group to show their role is different than the role of the incoming data.frame. At first this explicitness unfortunately reduces legibility as our code then looks like the following.

starwars %>%
  select(., one_of( c("name", "height") ))

Note this is not a criticism of one_of(), it is a discomfort of needing something like one_of(). And I fully admit: the popular dplyr style of not including the first argument in pipelines does not have the legibility problem; I myself introduced that problem by insisting on an explicit data argument. However, I have found that explicit arguments make it much easier for students to learn how to use dplyr functions simultaneously inside and outside pipelines. I also feel the explicit documentation of arguments has a number of down-stream advantages.

Minimize your reliance on implicit convention. What is obvious to you when writing the code may not be obvious to others, and may be something you don’t remember later. Along these lines we have a mini-style guide for effectively using dplyr with and without pipelines here.

Our specific legibility issue is just a matter of the nested “one_of(c("", ...))” construct being a bit clumsy. If we use an adapted version of select() that expects the list of columns to come in as a vector (as is typical for values in R) and use a vector constructor that does not need the quotes (such as qc(), please see R Tip: Use qc() For Fast Legible Quoting) we get a pipeline that is both very explicit (so more self-documenting) and quite convenient and legible:

library("wrapr")
library("seplyr")

starwars %>%
  select_se(., qc(name, height))

select_se() stands for “select standard evaluation”, meaning it is an adaption of select() that expects to be supplied the set of columns as a vector value. This function has a two-argument interface (data and vector of columns) and is simple to describe and reason about. qc() itself is a non-standard (or name capturing) interface. This is all qc() does, so it documents the user’s intent to capture names. If one does not mind the quotes one can avoid qc() entirely and write code such as the following.

columns <- c("name", "height")
select_se(starwars, columns)

The above is simple, as it should be. select_se() is a function that expects two values and we call it supplying two values. This may seem less magical than “starwars %> select(name, height)” (which involves piping, hidden function arguments, and name capture), and if so that is a good thing. Selecting a few columns is a basic task, so it should require a lot of cognitive load.

Even better than more variations on tool interfaces, is more tools to capture values that can be used and re-used many ways later.

Value capturing tools

Our group has been developing some simple tools for conveniently capturing values from the user. The idea is with these you get most of the convenience of having non-standard interfaces in many places, without the additional complexity of depending on non-standard interfaces being everywhere.

The trouble with nonstandard evaluation is that it doesn’t follow standard evaluation rules …

—Peter Dalgaard (about nonstandard evaluation in the curve() function) R-help (June 2011)

As quoted in the fortunes package.

Standard evaluation interfaces (or value oriented interfaces) are generally preferred because their primary property is referential transparency. Referential transparency is when expressions can be replaced by their evaluated values without changing outcomes. Sequentially replacing expressions with values is program evaluation.

But, away from theory in the large and back to programming in the small. Lets conclude with a few tool that make constructing useful values easier.

We have already seen qc() is “quoting concatenate”, which we have already demonstrated. It is used as follows.

v <- qc(name, height)

print(v)
# [1] "name"   "height"

dput(v)
# c("name", "height")

qc() can also be used to construct named vectors, which are very useful as maps.

map <- qc(a = A, b = B)

print(map)
#   a   b 
# "A" "B"

We also have a “print as paste-able code” function map_to_char(), which is a bit more convenient (for simple structures) than dput().

dput(map)
# structure(c("A", "B"), .Names = c("a", "b"))

map_to_char(map)
# [1] "c('a' = 'A', 'b' = 'B')"

We also have build_frame(), which is a convenience for typing in simple small data.frames directly in row-oriented form (similar in intent to tibble:tribble()):

d <- build_frame(
  "name", "value" |
  "a"   , 1       |
  "b"   , 2       )

print(d)
#   name value
# 1    a     1
# 2    b     2

The end of the first row is indicated by most any infix operator (we used “|“). More details on working with build_frame() can be found here.

The draw_frame() function can render small simple data.frames into paste-able form. This is a great way to capture and sure examples (without dates or other complex or annotated types).

cat(draw_frame(d))
# build_frame(
#    "name", "value" |
#    "a"   , 1       |
#    "b"   , 2       )

dput(d)
# structure(list(name = c("a", "b"), value = c(1, 2)), .Names = c("name", 
# "value"), row.names = c(NA, -2L), class = "data.frame")

Strip off the comment #-marks and you can paste the draw_frame() presentation into other work as legible code.

For data.frames that are purely string valued, we have qchar_frame(), which is essentially qc() for data.frames.

d <-  qchar_frame(
  name, value |
  a   , x     |
  b   , y     )

print(d)
#   name value
# 1    a     x
# 2    b     y

cat(draw_frame(d))
# build_frame(
#    "name", "value" |
#    "a"   , "x"     |
#    "b"   , "y"     )

The cdata package uses pure-character data.frames for pivot/un-pivot control structures, and thus can make good use of qchar_frame().

Conclusion

In conclusion: sometimes when you think you need more code, you actually just need to move more of your intent into data and values. In R it pays to treat as much as you can as values (data, selections, configuration, and even results).