Parametric variable names and dplyr

Win-Vector Blog 2016-12-05

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_), but their use can get tricky.

NewImage Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to dplyr users. To this end the replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr code to be used as if it was parametric.

The Problem

dplyr is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_ instead of dplyr::filter). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.

As an attempted work-around replyr now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr code simply in terms of known or assumed column names).

It is easier to show than explain.

An Example

First we set up our libraries and type in some notional data as our example:

# install.packages('devtools')                 # Run this if you don't already have devtools# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyrlibrary('dplyr')library('replyr')d <- data.frame(Sepal_Length=c(5.8,5.7),Sepal_Width=c(4.0,4.4),                Species='setosa',rank=c(1,2))print(d) #   Sepal_Length Sepal_Width Species rank # 1          5.8         4.0  setosa    1 # 2          5.7         4.4  setosa    2

Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):

nmap <- c('GroupColumn','ValueColumn','RankColumn')names(nmap) <-  c('Species','Sepal_Length','rank')print(nmap) #       Species  Sepal_Length          rank  # "GroupColumn" "ValueColumn"  "RankColumn" dtmp <- replyr_renameRestrictCols(d,nmap)print(dtmp) #   GroupColumn ValueColumn RankColumn # 1      setosa         5.8          1 # 2      setosa         5.7          2

At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr. You can then do your work:

# pretend this block is a huge sequence of complicated and expensive operations.dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero

Notice we were able to use dplyr::mutate without needing to use dplyr::mutate_ (and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again; imagine the joy in never having to write “dplyr::mutate_(.dots=stats::setNames(ni,ti))” ever again).

Once you have your desired result you restore the original names of our restricted column set:

invmap <- names(nmap)names(invmap) <- as.character(nmap)print(invmap) #    GroupColumn    ValueColumn     RankColumn  #      "Species" "Sepal_Length"         "rank" replyr_renameRestrictCols(dtmp,invmap) #   Species Sepal_Length rank # 1  setosa          5.8    0 # 2  setosa          5.7    1

If you haven’t worked a lot with dplyr this won’t look that interesting. If you do work a lot with dplyr you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_renameRestrictCols. Be aware: replyr::replyr_renameRestrictCols is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.

A Proposal

I feel the replyr::replyr_renameRestrictCols interface represents the correct design for a better dplyr based adapter.

I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr:

  1. view_as(df,columnNameMap) takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.
  2. unview() removes the view_as annotation.

Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).

I would also suggest that dplyr::arrange be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr data sources would support three primary annotations:

  1. Groups: placed by dplyr::group_by, removed by dplyr::ungroup, and viewed by dplyr::groups.
  2. Orders: placed by dplyr::arrange, removed by Xdplyr::unarrange (just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such as join), and viewed by Xdplyr::arrangement.
  3. Column Views: placed by Xdplyr::view_as, removed by Xdplyr::unview, and viewed by Xdplyr::views.

The “Xdplyr::” items are the extensions that are being proposed.

An Alternative Proposal

Another possibility would be some sort of “let” statement that controls name bindings for the duration of a block of code.

I’ll call this the “let block proposal.” The advantage of “let” is the block goes in and out of scope in an orderly manner, the disadvantage is the re-namings are not shared with called functions.

Using such a statement we would write our above example calculation as:

let( list(RankColumn='rank',GroupColumn='Species'), {  # pretend this block is a huge sequence of complicated and expensive operations.  d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero })

The idea is the items 'rank' and 'Species' could be passed in parametrically (notice the let specification is essentially invmap, so we could just pass that in). This isn’t quite R‘s “with” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr remote data sources (and the non-standard evaluation methods used to capture variables names).

It turns out gtools::strmacro is nearly what we need. For example following works:

gtools::strmacro(  RankColumn='rank',  expr={ d %>% mutate(RankColumn=RankColumn-1) })()

But the above stops just short of taking in the original column names parametrically. The following does not work:

RankColumnName <- 'rank'gtools::strmacro(  RankColumn=RankColumnName,  expr={ d %>% mutate(RankColumn=RankColumn-1) })()

I was was able to adapt code from gtools::strmacro to create a working let-block implemented as replyr::let:

replyr::let(   alias=invmap,   expr={     d %>% mutate(RankColumn=RankColumn-1)    })() #   Sepal_Length Sepal_Width Species rank # 1          5.8         4.0  setosa    0 # 2          5.7         4.4  setosa    1

Conclusion

I feel the above methods will make working with parameterized variables in dplyr much easier.