A bit of the agenda of Practical Data Science with R

Win-Vector Blog 2014-05-02

The goal of Zumel/Mount: Practical Data Science with R is to teach, through guided practice, the skills of a data scientist. We define a data scientist as the person who organizes client input, data, infrastructure, statistics, mathematics and machine learning to deploy useful predictive models into production.

Our plan to teach is to:

  • Order the material by what is expected from the data scientist.
  • Emphasize the already available bread and butter machine learning algorithms that most often work.
  • Provide a large set of worked examples.
  • Expose the reader to a number of realistic data sets.

Some of these choices may put-off some potential readers. But it is our goal to try and spend out time on what a data scientist needs to do. Our point: the data scientist is responsible for end to end results, which is not always entirely fun. If you want to specialize in machine learning algorithms or only big data infrastructure, that is a fine goal. However, the job of the data scientist is to understand and orchestrate all of the steps (working with domain experts, curating data, using data tools, and applying machine learning and statistics).

Once you define what a data scientist is, you find not as many people want to work as one.

We expand a few of our points below.

The data scientist must be actively involved in designing the business use of their work, the design of experiments, the collection of data, and the curation of measurements. They don’t have to be a domain expert, but they must develop domain empathy. The data scientist can delegate tasks, but in doing so must manage them. What they can’t do is assume some other kind soul has arranged everything so a brilliant moment of machine learning gets to quickly save the day. The data scientist must work hard (with others) so the project is only such a short step away from success (and to, of course, decisively take that step).

Exotic machine learning tinkering is not usually the most valuable part of a data science project. We have found this can be very hard for analytically minded programmers to accept. It is a fact that there are already many good implementations of very effective vanilla machine learning algorithms freely available in many languages working at many scales. Even if it is just for the sake of calibration you owe it to a client to try at least one of: regression, generalized additive models, logistic regression, decision trees, random forests, support vector machines, naive Bayes, or k-nearest neighbor before trapping them in your special custom library. Once you have organized your data trying any of these techniques is a one-liner in the right framework. You don’t get a lot of value from using any method blindly, but there is a lot of wisdom you can quickly pick up from the built in diagnostics and the pre-existing literature of the common methods.

In all honesty, mere implementation doesn’t always give you deep understanding of the limits and consequences of methods. Implementation rapidly becomes overwhelmed with non-statistical details (run time considerations, organizations of storage, choices in representation, and so on). Extracting meaning and consequences from implementation details is hard work. One brilliant exception is “The Elements of Statistical Learning” (Hastie, Tibshirani, and Friedman) where the authors emphasize estimating the quality, properties and stopping conditions of statistical machine learning algorithms. The Elements of Statistical Learning uses precisely defined implementation to drive precise evaluations of expected outcomes (whereas lesser machine learning books claim authority through mere breadth, description and typesetting; failing to actually check, run, analyze, compare, or even work in a unified notation).

In a detailed implementation you switch from thinking about machine learning and statistics to thinking a lot about coding. We love to tinker and implement (for example logistic regression on Hadoop), but we also had a reason (paving the way for a needed feature not in most common logistic regression implementations). We learned a lot more putting corner-case data through standard implementations. A data-scientist should re-implement a few machine learning algorithms for fun, but not on a client’s dime.

In our book we demonstrate how to implement two machine learning algorithms: naive Bayes and bagging. Neither method is currently considered particularly cutting edge or exciting. We included them for specific didactic reasons. The discussion of naive Bayes gave us a concrete opportunity to re-demonstrate things we had already discussed about data preparation and variable treatment. It also let us introduce some of the laws of probability that govern useful prediction. Working through bagging gave us a concrete opportunity to work through fundamental examples of controlling prediction bias and variance. Notice we are talking about issues of data and statistics, not issues of algorithms and programming.

This may lead one to think our book is not concrete. That is not the case. Our book is example driven and very concrete. Every analysis (and almost every graph in the book) is completely demonstrated on real scale shared example data (with complete shared code and data). Very few data science books attempt to demonstrate specific use of so many of the techniques described. We consider it a strength that many very expensive and sophisticated analyses (like random forests and radial kernel support vector machines) are one liners. We also consider some of our uglier steps (such as showing the the option setting needed to prevent sqldf from crashing R on OSX) to be high-value, they ensure we have shown how to get an analysis all the way from start to end (and not just to convenient intermediate points).

With the specific experience of what goes well and what goes poorly using standard techniques a good data scientist can eventually come up with a new method that potentially out-performs standard methods in a given domain. This is very high value contribution (one of our favorite tasks). It is quicker and easier to invent a method that works great in one domain after you see how standard methods work in that domain.

The intent of Practical Data Science with R is to be a useful concrete example of how to do data science. We limited ourselves to working in R not because a data scientist can choose to work exclusively in R (they can not), but to limit the number of external tools and considerations we had to discuss before getting to the actual examples. To use our book you will need to work the examples (and this can be as shallow as cutting and pasting code or as deep as trying variations after the data has been loaded).

A good number of people have put in a lot of effort (researching, working, writing, refereeing, editing) to ensure that Practical Data Science with R is a good book. Whether it is the book for you depends on how it matches your interests and and background. To help evaluate the book we have made available two example chapters, all data and all source code here.