Running aggregate on wide matrices takes loooong; use nested apply or data tables instead !
R-bloggers 2025-04-04
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The aggregate function can be very useful in R, allowing one to run a function (e.g. mean) within groups of rows, in each column in a matrix/data-frame and organize the results in an easy-to-read table. However, the function takes long to run for very wide matrices and data frames, where the number of the columns are large. I this post I demonstrate the issue and show a couple of nice solutions that at least for the example cuts down the time to 15% and even less, compared to the run-time of the aggregate function.
I first created a wide matrix with 100 rows and 10,000 columns, housing 1,000,000 randomly generated values using normal distribution.
# The necessity to avoid wide matrices (with lots of columns)! matWide= matrix(rnorm(1e+06),nrow=100, ncol=10000)
# Transform matrix to data frame dfWide=as.data.frame(matWide)
I used the aggregate function to take the mean within groups of rows, for each column ! I realized that the aggregate function takes about 4 seconds to run.
Here is the fist 5 columns and rows of the result data frame and its dimensions.
aggRes[1:5,1:5] # Group.1 V1 V2 V3 V4 #1 1 0.008815372 0.56920407 0.2195522 0.68183883 #2 2 0.046319580 0.07915253 0.2732586 0.30970451 #3 3 0.154718798 -0.09157008 -0.3676212 -0.02970137 #4 4 0.491208585 0.53066464 -0.1407269 0.49633703 #5 5 -0.397868879 -0.09793382 0.4154764 -0.17150871
dim(aggRes) #[1] 10 10001
Then I used a nested ‘apply’ function (technically a tapply inside an apply function) approach to run the same analysis. It took significantly less time (about half a second).
t1=Sys.time() nestApplyRes=apply(dfWide, 2, function(x){ return(tapply(x, rep(1:10, each=10), mean))}) nestApplyRes=data.frame(Group.1=rownames(nestApplyRes), nestApplyRes) (timeDifNest=difftime(Sys.time(), t1, units = “secs”)) #Time difference of 0.5010331 secs
#Check if it provides exactly the same result as aggregate all(aggRes==nestApplyRes) #[1] TRUE
Eventually, I used the data tables as it has been suggested by few in some forums. It took even less time to run; about 0.26 second.
library(data.table) t1=Sys.time() #Convert to data.table and compute means in column-major order (like aggregate) dtRes <- as.data.table(dfWide)[, lapply(.SD, function(x) mean(x)), by = .(Group.1 = rep(1:10, each = 10))] dtRes=as.data.frame(dtRes) (timeDifDt=difftime(Sys.time(), t1, units = “secs”)) #Time difference of 0.268255 secs all(aggRes==dtRes) #TRUE
I also plotted the run time of each of the approaches!
jpeg(“TimeDif.jpg”, res=300, width=800, height=800) par(mar = c(6.5, 2.5, 1.5, 0.5)) barplot(height = as.numeric(c(timeDifAggr, timeDifNest, timeDifDt)), names.arg =c(“Aggregate”, “Nested apply”, “Data table”), las=2 , ylim=c(0,4), col=heat.colors(3), ylab=”Sec”) dev.off()
So now I’ll think twice before using the aggregate function
.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
