Lots of data != "Big Data"

R-bloggers 2013-03-28

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

by Joseph Rickert

Whentalking with data scientists and analysts — who are working with large scale dataanalytics platforms such as Hadoop — about the best way to do some sophisticatedmodeling task it is not uncommon for someone to say, "We have all of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonableto almost everyone. After all, wouldn’t a model based on all of the data bebetter than a model based on a subsample? Well, maybe not — it depends, ofcourse, on the problem at hand as well as time and computational constraints. Toillustrate the kinds of challenges that large data sets present, let’s just lookat something very simple using the airlines data set from the 2009 ASAchallenge.

Here are some of the results for a regression of ArrDelay onCRSDepTime with a random sample of 12,283 records drawn from that data set:

# Coefficients: #             EstimateStd. Error t value Pr(>|t|)   # (Intercept)-0.85885    0.80224  -1.071   0.284 # CRSDepTime   0.56199   0.05564  10.100 2.22e-16

# MultipleR-squared: 0.008238 # Adjusted R-squared: 0.008157

And here are some results from the same model using 120,947,440records:

#Coefficients: #                 EstimateStd. Error t value Pr(>|t|)   # (Intercept)-2.4021635  0.0083532  -287.6 2.22e-16 *** # CRSDepTime   0.6990404 0.0005826  1199.9 2.22e-16 ***

# MultipleR-squared: 0.01176 # Adjusted R-squared: 0.01176

More data data didn’t yield an obvously better model! I don’t think anyonewould really find this to be much of a surprise. We are dealing with a not verygood model to begin with. Nevertheless, the example does provide theopportunity to investigate how estimates of the coefficients change with samplesize. This next graph shows the coeffients of the slope plotted against samplesize with sample sizes ranging from 12,283 to 12,094,709 records. Eachregression was done on a random sample that includes about 12,000 points morethan the previous one. The graph also shows the standard estimate for theconfidence interval for the coefficient at each point in red. Notice that aftersome initial instability, the coefficient estimates settle down to somethingclose to the value of beta obtained using all of the data.

Final_beta_vs_N The rapid approach to the full-data-set value of the coefficient is evenmore apparent in the following graph that shows the difference between theestimated value of the beta coefficient at each sample and the value obtained usingall of the data. The maximum difference from the fourth sample on is 0.07. This is pretty close indeed. In cases like this, if youbelieved that your samples were representative of the entire data set, workingwith all of the data to evaluate possible models would be a waste of time an possibly counterproductive.

Delta_beta_final

I am certainly not arguing that one never wants to use all of the data.For one thing, when scoring a model or making predictions the goal is to dosomething with all of the records. Moreover, in more realistic modelingsituations where there are thousands of predictor variables 120M observationsmight not be enough data to conclude anything. A large model can digest degreesof freedom very quickly and severely limit the ability to make any kind of statisticalinference. I do want to argue, however, that with large data sets the abilityto work with random samples of the data confers the freedom to examine severalmodels quickly with considerable confidence that results would be decentestimates of what would be obtained in using the full data set.

I did the random sampling and regressions in my little example usingfunctions from Revolution Analytics RevoScaleR package. Initially, all of thedata was read from the csv files that comprise the FAA data set into the binary.xdf file format that is used by the RevoScaleR package. Then the random samples were selectedby using the rxDataStep function of RevoScaleR. This function was designed toquickly manipulate large data sets.  Thecode below reads a record, draws a random number with a value between 1 and 9999and assigns it to the variable urns.

rxDataStep(inData = working.file,        outFile = working.file,           transforms=list(urns = as.integer(runif(.rxNumRows,1,10000))),           overwrite=TRUE) 

Random samples for each regression were drawn by looping throught theappropriate values of the variables. Notice how the call to R’s runif()function happens within the transforms parameter of rxDataStep. It took about33 seconds to do the full regression on my laptop which made it feasible toundertake the extravagent number of calculations necessary to do the 1,000regressions in a few hours after dinner.

I think there are three main take aways from this exercise:

  1. Lots of data does not necessarily equate to “Big Data”
  2. For exploratory modeling you want to work in an environment that allowsfor the rapid prototyping and provides the statistical tools for modelevaluation and visualizations. There is no better environment that R for thiskind of work, and the Revolution’s distribution of R offers the ability to workwith very large samples.
  3. The ability draw random samples from large data sets is the way to balanceaccuracy against computational constraints.

To my way of thinking, the single most important capability to implementin any large scale data platform that is going to support sophisticatedanalytics is the ability to quickly construct, high quality random samples.

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...