"Combining R's sophisticated calculations and MonetDB's excellent data access performance is a no-brainer. One gets the best of two (open source) worlds with minimal hassle." - Dr. Hannes Mühleisen
"oh wow that was fast like a cheetah with a jetpack or something" - anthony damico
why try monetdb + r
a speed test of four analysis commands on
sixty-seven million physician visit records using my personal laptop --
# calculate the sum of a single variable.. system.time( print( sum( carrier08$car_hcpcs_pmt_amt ) ) ) [1] 3477564780 user system elapsed 0.00 0.00 0.04 seconds # ..or calculate the sum, mean, median, and count of a single variable with sql system.time( dbGetQuery( db , 'select sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08' ) ) user system elapsed 0.01 0.00 16.86 seconds # calculate the same statistics, broken down by six age and two gender categories system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd' ) ) user system elapsed 0.00 0.01 36.03 seconds # calculate the same statistics, broken down by six age, two gender, # and 924 icd-9 diagnosis code categories system.time( dbGetQuery( db , 'select bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd, sum( car_hcpcs_pmt_amt ), avg( car_hcpcs_pmt_amt ), median( car_hcpcs_pmt_amt ), count(*) from carrier08 group by bene_sex_ident_cd, bene_age_cat_cd, car_line_icd9_dgns_cd' ) ) user system elapsed 0.04 0.00 20.50 seconds -- you're not using computer hardware built in 1966, you shouldn't use
software written for that era, either.
how to get started with monetdb + r
click here to install (free, open-source) monetdb on your windows computer in two count them two easy steps ..now that you have monetdb and a few r packages installed..
..once you've got your database set up..
a brief history of monetdb + r
like so many other legends, our story begins at
the 2007 r statistical programming conference in iowa.
dr. thomas lumley presented
his idea for big survey data to his contemporaries, who - in predictable contemporary form - failed to acknowledge its genius. over the next half-decade, only
ill-advised attempts were made at analyzing the big survey data beast. for work (invention's mama), i needed to access
the fifteen-million record, five year american community survey files, but since
database-backed survey objects read all
replicate weights into ram, my sixteen gigabyte desktop hissed, popped, crapped out. so
i e-mailed dr. lumley and asked for ideas. next thing i know, he had developed..
..but the folks in amsterdam caught on, assigning codesmith
dr. hannes mühleisen to the task of writing
a java-free connector. this new sql connection between r and monetdb works perfectly, but
cwi has doubled down and dr. mühleisen is building a hush-hush `
monet.frame` object designed to replicate r `data.frame` objects as closely as possible. still a work in progress, you'll see `monet.frame` commands sprinkled throughout the code for the
american community survey, the
behavioral risk factor surveillance system, and the
medicare basic standalone public use files.
summary()
turns out, monetdb is lightning fast on any big data, not just surveys. no reason for demographers to hog all the fun. pro tip: if you're not interested in survey data and just want to use monetdb for its big data chops, start with the
bsapuf example code - that's not survey data either. for more detail about monetdb, visit:
notes: there's a price for such fast data access. importing a table into monetdb takes a while, so prepare to let your initial import code run overnight. it's a good deal: leaving your computer on for one night sure beats running hour-long commands for
every new analysis. the
MonetDB.R and
sqlsurvey packages are experimental. the more you use them, the sooner they won't be. if you hit something you don't understand (especially a line of code that works without error in
the survey package), read the documentation carefully before contacting the author. sqlsurvey commands do not uniformly behave like
survey commands. remember, all scripts on this archive work on my 2009-era windows seven laptop (with four gigabytes of ram). by default, r reads objects into memory. when a data set is too big, the analysis scripts presented on this website work around memory limitations by connecting to either a monetdb (speedy) or
sqlite (easy-to-use) database. many government data sets are only available as
fixed-width (flat) files accompanied by a sas import script, and the big data that necessitates MonetDB.R is no exception. i've written
a variant of the read.SAScii() function to import ascii files directly into a monet database all in one step. you may notice it in the code for some of the large surveys. confidential to sas, spss, stata, sudaan users: if you start analyzing big data with r + monetdb, you will no longer have to wait around long enough to take a coffee break after running each command. for that, i apologize. :D
R-bloggers.com offers
daily e-mail updates about
R news and
tutorials on topics such as: visualization (
ggplot2,
Boxplots,
maps,
animation), programming (
RStudio,
Sweave,
LaTeX,
SQL,
Eclipse,
git,
hadoop,
Web Scraping) statistics (
regression,
PCA,
time series,
ecdf,
trading) and more...