Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !

R-bloggers 2025-04-04

[This article was first published on gacatag, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The aggregate function can be very useful in R, allowing one to run a function (e.g. mean) within groups of rows, in each column in a matrix/data-frame and organize the results in an easy-to-read table. However, the function takes long to run for very wide matrices and data frames, where the number of the columns are large. I this post I demonstrate the issue and show a couple of nice solutions that at least for the example cuts down the time to 15% and even less, compared to the run-time of the aggregate function.

 I first created a wide matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using normal distribution. 

# The necessity to avoid wide matrices (with lots of columns)! matWide= matrix(rnorm(1e+06),nrow=100, ncol=10000)

# Transform matrix to data frame dfWide=as.data.frame(matWide)

I used the aggregate function to take the mean within groups of rows, for each column !  I realized that the aggregate function takes about 4 seconds to run.

t1=Sys.time() aggRes=aggregate(dfWide, list(rep(1:10, each=10)), mean) (timeDifAggr=difftime(Sys.time(), t1, units = “secs”)) #Time difference of 3.807029 secs

Here is the fist 5 columns and rows of the result data frame and its dimensions.

aggRes[1:5,1:5] #  Group.1           V1          V2         V3          V4 #1       1  0.008815372  0.56920407  0.2195522  0.68183883 #2       2  0.046319580  0.07915253  0.2732586  0.30970451 #3       3  0.154718798 -0.09157008 -0.3676212 -0.02970137 #4       4  0.491208585  0.53066464 -0.1407269  0.49633703 #5       5 -0.397868879 -0.09793382  0.4154764 -0.17150871

dim(aggRes) #[1]    10 10001  

Then I used a nested ‘apply’ function (technically a tapply inside an apply function) approach to run the same analysis. It took significantly less time (about half a second).

t1=Sys.time() nestApplyRes=apply(dfWide, 2, function(x){   return(tapply(x, rep(1:10, each=10), mean))}) nestApplyRes=data.frame(Group.1=rownames(nestApplyRes),                         nestApplyRes) (timeDifNest=difftime(Sys.time(), t1, units = “secs”)) #Time difference of 0.5010331 secs

#Check if it provides exactly the same result as aggregate all(aggRes==nestApplyRes) #[1] TRUE

Eventually, I used the data tables as it has been suggested by few in some forums. It took even less time to run; about 0.26 second. 

  library(data.table) t1=Sys.time() #Convert to data.table and compute means in column-major order (like aggregate) dtRes <- as.data.table(dfWide)[, lapply(.SD, function(x) mean(x)), by = .(Group.1 = rep(1:10, each = 10))] dtRes=as.data.frame(dtRes) (timeDifDt=difftime(Sys.time(), t1, units = “secs”)) #Time difference of 0.268255 secs all(aggRes==dtRes) #TRUE

I also plotted the run time of each of the approaches!

  jpeg(“TimeDif.jpg”, res=300, width=800, height=800) par(mar = c(6.5, 2.5, 1.5, 0.5)) barplot(height = as.numeric(c(timeDifAggr, timeDifNest, timeDifDt)),         names.arg =c(“Aggregate”, “Nested apply”, “Data table”),         las=2 , ylim=c(0,4), col=heat.colors(3), ylab=”Sec”) dev.off()

So now I’ll think twice before using the aggregate function 😒.

To leave a comment for the author, please follow the link and comment on their blog: gacatag.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Continue reading: Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !