Data Science, Machine Learning, and Statistics: what is in a name?

Win-Vector Blog 2013-04-22

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” But the essence of the complaint is a feeling of claiming novelty in putting old wine in new bottles. Rob Tibshirani nailed this type of distinction in is famous machine learning versus statistics glossary.

I would like to explain why we (the authors of this blog) often use the term data science. Nina Zumel explained being a data scientist very well, I am going to take a swipe at explaining data science.

We (the authors on this blog) label many of our articles as being about data science because we want to emphasize that the various techniques we write about are only meaningful when considered parts of a larger end to end process. The process we are interested in is the deployment of useful data driven models into production. The important components are learning the true business needs (often by extensive partnership with customers), enabling the collection of data, managing data, applying modeling techniques and applying statistics criticisms. The pre-existing term I have found that is closest to describing this whole project system is data science, so that is the term I use. I tend to use it a lot, because while I love the tools and techniques our true loyalty is to the whole process (and I want to emphasize this to our readers).

The phrase “data science” as in use it today is a fairly new term (made popular by William S. Cleveland, DJ Patil, and Jeff Hammerbacher). I myself worked in a “computational sciences” group in the mid 1990′s (this group emphasized simulation based modeling of small molecules and their biological interactions, the naming was an attempt to emphasize computation over computers). So for me “data science” seems like a good term when your work is driven by data (versus driven from computer simulations). For some people data science is considered a new calling and for others it is a faddish misrepresentation of work that has already been done. I think there are enough substantial differences in approach between traditional statistics, machine learning, data mining, predictive analytics, and data science to justify at least this much nomenclature. In this article I will try to describe (but not fully defend) my opinion.

My breakdown of the different information sciences is given below (I try to treat each with the respect it deserves, so I am certain to offend all). For this article I am most interested the fields that lean towards modeling, so I will tend to move on from topics that are not centered on this topic.

The nature of statistics

Statistics is the original computing with data. It is the field that deals with data with the most portability (it isn’t dependent on one type of physical model) and rigor. Statistics can be a pessimal field: statisticians are the masters of anticipating what can go wrong with experiments and what fallacies can be drawn from naive uses of data. Statistics has enough techniques to solve just about any problem, but it also has an inherent conservatism to it.

I often say the best source of good statistical work is bad experiments. If all experiments were well conducted, we wouldn’t need a lot of statistics. However, we live in the real world; most experiments have significant shortcomings and statistics is incredibly valuable.

Another aspect of statistics is it is the only field that really emphasizes the risks of small data. There are many other potential data problems statistics describes well (like Simpson’s paradox), but statistics is fairly unique in the information sciences in emphasizing the risks of trying to reason from small datasets. This is actually very important: datasets that are expensive to produce (such as drug trials) are necessarily small.

It is only recently that minimally curated big datasets became perceived as being inherently valuable (the earlier attitude being closer to GIGO). And in some cases big data is promoted as valuable only because it is the cheapest to produce. Often a big dataset (such as logs of all clicks seen on a search engine) is useful largely because they are a good proxy for a smaller dataset that is too expensive to actually produce (such as interviewing a good cross section of search engine users as to their actual intent).

If your business is directly producing truly valuable data (not just producing useful proxy data) you likely have small data issues. If you have any hint of a small data issue, you want to consult with a good statistician.

The nature of machine learning

In some sense machine learning rushes where statisticians fear to tread. Machine learning does have some concept of small data issues (such as knowing about over-fitting), but it is an essentially optimistic field.

The goal of machine learning is to create a predictive model that is indistinguishable from a correct model. This is an operational attitude that tends to offend statisticians who want a model that not only appears to be accurate but is in fact correct (i.e. also has some explanatory value).

My opinion is the best machine learning work is an attempt to re-phrase prediction as an optimization problem (see for example: Bennett, K. P., & Parrado-Hernandez, E. (2006). The Interplay of Optimization and Machine Learning Research. Journal of Machine Learning Research, 7, 1265–1281). Good machine learning papers use good optimization techniques and bad machine learning papers (most of them in fact) use bad out of date ad-hoc optimization techniques.

The nature of data mining

Data mining is a term that was quite hyped and now somewhat derided. One of the reasons more people use the term “data science” nowadays is they are loath to say “data mining” (though in my opinion the two activities have different goals).

The goal of data mining is to find relations in data, not to necessarily make predictions or come up with explanations. Data mining is often what I call “an x’s only enterprise” (meaning you have many driver or “independent” variables but no pre-ordained outcome or “dependent” variables) and some of the typical goals are clustering, outlier detection and characterization.

There is a sense that when it was called exploratory statistics it was considered boring, but when it was called data mining it was considered sexy. Actual exploratory statistics (as defined by Tukey) is exciting and always an important “get your hands into the data” step of any predictive analytics project.

The nature of informatics

Informatics and in particular bioinformatics are very hot terms. A lot of good data scientists (a term I will explain later) come from the bioinformatics field.

Once we separate out the portions of bioinformatics that are in fact statistics and the ones that are in fact biology we are left with data infrastructure and matching algorithms. We have the creation and management of data stores, data bases and design of efficient matching and query algorithms. This isn’t meant to be a left handed compliment: algorithms are a first love of mine and some of the matching algorithms bioinformaticians uses (like online suffix trees) are quite brilliant.

The nature of big data

Big data is a white-hot topic. The thing to remember is: it is just the infrastructure (MapReduce, Hadoop, noSQL and so on). It is the platform you perform modeling (or usually just report generation) on top of.

The nature of predictive analytics

The Wikipedia defines Predictive analytics as the “… variety of techniques from statistics, modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events.” It is a set of goals and techniques emphasizing making models. It is very close to what is also meant by data science.

I don’t tend to use the term predictive analytics because I come from a probability, simulation, algorithms and machine learning background and not from an analytics background. To my ear analytics is more associated with visualization, reporting and summarization than with modeling. I also try to use the term modeling over prediction (when I remember) as prediction often in non-technical English implies something like forecasting into the future (which is but one modeling task).

The nature of data science

The Wikipedia defines data science as a field that “incorporates varying elements and builds on techniques and theories from many fields, including math, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products.”

Data science is a term I use to represent the ownership and management of the entire modeling process: discovering the true business need, collecting data, managing data, building models and deploying models into production.

Conclusion

Machine learning and statistics may be the stars, but data science the whole show.