column-store R or: how i learned to stop worrying and love monetdb

R-bloggers 2013-03-18

(This article was first published on asdfree by anthony damico, and kindly contributed to R-bloggers)

"Combining R's sophisticated calculations and MonetDB's excellent data access performance is a no-brainer. One gets the best of two (open source) worlds with minimal hassle." - Dr. Hannes Mühleisen

"oh wow that was fast like a cheetah with a jetpack or something" - anthony damico

why try monetdb + r

a speed test of four analysis commands on sixty-seven million physician visit records using my personal laptop -- # calculate the sum of a single variable.. system.time( print( sum( carrier08$car_hcpcs_pmt_amt ) ) ) [1] 3477564780 user system elapsed 0.00 0.00 0.04 seconds # ..or calculate the sum, mean, median, and count of a single variable with sql system.time( dbGetQuery( db , 'select sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08' ) ) user system elapsed 0.01 0.00 16.86 seconds # calculate the same statistics, broken down by six age and two gender categories system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd' ) ) user system elapsed 0.00 0.01 36.03 seconds # calculate the same statistics, broken down by six age, two gender, # and 924 icd-9 diagnosis code categories system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd' ) ) user system elapsed 0.04 0.00 20.50 seconds -- you're not using computer hardware built in 1966, you shouldn't use software written for that era, either.

how to get started with monetdb + r

click here to install (free, open-source) monetdb on your windows computer in two count them two easy steps

..now that you have monetdb and a few r packages installed..

click here to create your first database with monetdb

..once you've got your database set up..

click here to run the monetdb server, connect to it with r, and hooray actually do stuff

a brief history of monetdb + r

like so many other legends, our story begins at the 2007 r statistical programming conference in iowa. dr. thomas lumley presented his idea for big survey data to his contemporaries, who - in predictable contemporary form - failed to acknowledge its genius. over the next half-decade, only ill-advised attempts were made at analyzing the big survey data beast. for work (invention's mama), i needed to access the fifteen-million record, five year american community survey files, but since database-backed survey objects read all replicate weights into ram, my sixteen gigabyte desktop hissed, popped, crapped out. so i e-mailed dr. lumley and asked for ideas. next thing i know, he had developed..

an r package to connect to the column-oriented, ultra-fast monetdb
a sql-driven branch of his broader survey software.

..but the folks in amsterdam caught on, assigning codesmith dr. hannes mühleisen to the task of writing a java-free connector. this new sql connection between r and monetdb works perfectly, but cwi has doubled down and dr. mühleisen is building a hush-hush `monet.frame` object designed to replicate r `data.frame` objects as closely as possible. still a work in progress, you'll see `monet.frame` commands sprinkled throughout the code for the american community survey, the behavioral risk factor surveillance system, and the medicare basic standalone public use files.

summary()

turns out, monetdb is lightning fast on any big data, not just surveys. no reason for demographers to hog all the fun. pro tip: if you're not interested in survey data and just want to use monetdb for its big data chops, start with the bsapuf example code - that's not survey data either. for more detail about monetdb, visit:

the monetdb homepage
the monetdb sql reference guide
the monetdb funders (scroll down)
one, two bugs with the windows version of monetdb that i found and they promptly fixed. heroes.
blog post outlining why column-store databases are faster

notes: there's a price for such fast data access. importing a table into monetdb takes a while, so prepare to let your initial import code run overnight. it's a good deal: leaving your computer on for one night sure beats running hour-long commands for every new analysis. the MonetDB.R and sqlsurvey packages are experimental. the more you use them, the sooner they won't be. if you hit something you don't understand (especially a line of code that works without error in the survey package), read the documentation carefully before contacting the author. sqlsurvey commands do not uniformly behave like survey commands. remember, all scripts on this archive work on my 2009-era windows seven laptop (with four gigabytes of ram). by default, r reads objects into memory. when a data set is too big, the analysis scripts presented on this website work around memory limitations by connecting to either a monetdb (speedy) or sqlite (easy-to-use) database. many government data sets are only available as fixed-width (flat) files accompanied by a sas import script, and the big data that necessitates MonetDB.R is no exception. i've written a variant of the read.SAScii() function to import ascii files directly into a monet database all in one step. you may notice it in the code for some of the large surveys. confidential to sas, spss, stata, sudaan users: if you start analyzing big data with r + monetdb, you will no longer have to wait around long enough to take a coffee break after running each command. for that, i apologize. :D

To leave a comment for the author, please follow the link and comment on his blog: asdfree by anthony damico.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...