Parametric variable names and dplyr
Win-Vector Blog 2016-12-05
When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R
libraries that assume you know the variable names. The R
data manipulation library dplyr
currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_
), but their use can get tricky.
Better support for parametric treatment of variable names would be a boon to dplyr
users. To this end the replyr
package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr
code to be used as if it was parametric.
The Problem
dplyr
is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr
has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_
instead of dplyr::filter
). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.
As an attempted work-around replyr
now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr
code simply in terms of known or assumed column names).
It is easier to show than explain.
An Example
First we set up our libraries and type in some notional data as our example:
# install.packages('devtools') # Run this if you don't already have devtools# devtools::install_github('WinVector/replyr') # Run this if you don't already have replyrlibrary('dplyr')library('replyr')d <- data.frame(Sepal_Length=c(5.8,5.7),Sepal_Width=c(4.0,4.4), Species='setosa',rank=c(1,2))print(d) # Sepal_Length Sepal_Width Species rank # 1 5.8 4.0 setosa 1 # 2 5.7 4.4 setosa 2
Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):
nmap <- c('GroupColumn','ValueColumn','RankColumn')names(nmap) <- c('Species','Sepal_Length','rank')print(nmap) # Species Sepal_Length rank # "GroupColumn" "ValueColumn" "RankColumn" dtmp <- replyr_renameRestrictCols(d,nmap)print(dtmp) # GroupColumn ValueColumn RankColumn # 1 setosa 5.8 1 # 2 setosa 5.7 2
At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr
. You can then do your work:
# pretend this block is a huge sequence of complicated and expensive operations.dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
Notice we were able to use dplyr::mutate
without needing to use dplyr::mutate_
(and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again; imagine the joy in never having to write “dplyr::mutate_(.dots=stats::setNames(ni,ti))
” ever again).
Once you have your desired result you restore the original names of our restricted column set:
invmap <- names(nmap)names(invmap) <- as.character(nmap)print(invmap) # GroupColumn ValueColumn RankColumn # "Species" "Sepal_Length" "rank" replyr_renameRestrictCols(dtmp,invmap) # Species Sepal_Length rank # 1 setosa 5.8 0 # 2 setosa 5.7 1
If you haven’t worked a lot with dplyr
this won’t look that interesting. If you do work a lot with dplyr
you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_renameRestrictCols
. Be aware: replyr::replyr_renameRestrictCols
is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.
A Proposal
I feel the replyr::replyr_renameRestrictCols
interface represents the correct design for a better dplyr
based adapter.
I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr
:
-
view_as(df,columnNameMap)
takes a data item and returns a data item reference that behaves as if the column names have been re-mapped. -
unview()
removes theview_as
annotation.
Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr
does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).
I would also suggest that dplyr::arrange
be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr
data sources would support three primary annotations:
-
Groups
: placed bydplyr::group_by
, removed bydplyr::ungroup
, and viewed bydplyr::groups
. -
Orders
: placed bydplyr::arrange
, removed byXdplyr::unarrange
(just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such asjoin
), and viewed byXdplyr::arrangement
. -
Column Views
: placed byXdplyr::view_as
, removed byXdplyr::unview
, and viewed byXdplyr::views
.
The “Xdplyr::
” items are the extensions that are being proposed.
An Alternative Proposal
Another possibility would be some sort of “let
” statement that controls name bindings for the duration of a block of code.
I’ll call this the “let block proposal.” The advantage of “let
” is the block goes in and out of scope in an orderly manner, the disadvantage is the re-namings are not shared with called functions.
Using such a statement we would write our above example calculation as:
let( list(RankColumn='rank',GroupColumn='Species'), { # pretend this block is a huge sequence of complicated and expensive operations. d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero })
The idea is the items 'rank'
and 'Species'
could be passed in parametrically (notice the let
specification is essentially invmap
, so we could just pass that in). This isn’t quite R
‘s “with
” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr
remote data sources (and the non-standard evaluation methods used to capture variables names).
It turns out gtools::strmacro
is nearly what we need. For example following works:
gtools::strmacro( RankColumn='rank', expr={ d %>% mutate(RankColumn=RankColumn-1) })()
But the above stops just short of taking in the original column names parametrically. The following does not work:
RankColumnName <- 'rank'gtools::strmacro( RankColumn=RankColumnName, expr={ d %>% mutate(RankColumn=RankColumn-1) })()
I was was able to adapt code from gtools::strmacro
to create a working let
-block implemented as replyr::let
:
replyr::let( alias=invmap, expr={ d %>% mutate(RankColumn=RankColumn-1) })() # Sepal_Length Sepal_Width Species rank # 1 5.8 4.0 setosa 0 # 2 5.7 4.4 setosa 1
Conclusion
I feel the above methods will make working with parameterized variables in dplyr
much easier.