New Timings for a Grouped In-Place Aggregation Task
Win-Vector Blog 2020-01-02
I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.
We can create an abstract version of such data in R as follows.
set.seed(2020)
n <- 1000000
mk_data <- function(n) {
d <- data.frame(x = rnorm(n))
d$g <- sprintf("level_%09g",
sample.int(n, size = n, replace = TRUE))
return(d)
}
d <- mk_data(n)
The sampling with replacement has an expected number of unique IDs in the ballpark of n/log(n)
via the coupon collector’s problem. So we expect lots of small groups in such data.
Our task can be specified in rquery/rqdatatable notation as follows.
library(rqdatatable)
ops_rqdatatable <- local_td(d, name = 'd') %.>%
extend(.,
rn %:=% row_number(),
cs %:=% cumsum(x),
partitionby = 'g',
orderby = 'x') %.>%
order_rows(.,
c('g', 'x'))
The key step is the extend()
, which adds the new columns rn
and cs
in a per-g
group manner in a by-x
order. We feel the notation is learnable and expressive. (Note: normally we would use :=
for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used %:=%
to stay out of data.table’s way.)
We translated the same task in to several different notations: data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.
Mean task run times in second (smaller is better) Method Interface Language Data Engine Mean run time in seconds rqdatatable R data.table 3.8 data.table R data.table 2.1 dplyr R dplyr 35.1 dtplyr R data.table 5.1 data_algebra Python Pandas 17.1What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).
What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods.
Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.