Lots of data != "Big Data"
R-bloggers 2013-03-28
Summary:
by Joseph Rickert When talking with data scientists and analysts — who are working with large scale data analytics platforms such as Hadoop — about the best way to do some sophisticated modeling task it is not uncommon for someone to say, "We have all of the data. Why not just use it all?" This sort of comment often initially sounds pragmatic and reasonable to almost everyone. After all, wouldn’t a model based on all of the data be better than a model based on a subsample? Well, maybe not — it depends, of course, on the problem at hand...