Data science project planning

Win-Vector Blog 2013-03-15

Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to the statistical ideas we shared in setting expectations in data science projects we share a few project planning ideas from software engineering.

Our approach

The main idea we advocate is: realize project planning under uncertainty is a solved problem. Software engineers already know how to do this. You may feel data science has steps that require un-schedulable inspiration or invention. But that should not be the case. You always have to invent one or two techniques per data science project. But most data science projects do not in fact require new research or breakthroughs in machine learning.

The planning process that works is: identify common stereotypical sub-stories, components and processes. Then throw these at the problem and document which ones are appropriate. This is a deliberate combination of “the Feynman method to be a genius” and agile story based planning. At a 30,000 foot view data science project planning looks a lot like software development project planning. Both have a lot of risk and dependent steps. The way to solve that is to get the customer more involved and make the results of the steps and the quality of the results of the steps more visible (so the customer can critique and steer prior to everything being done). So our prescription is to use established software development project planning techniques and work on breaking the data science project into specifiable components. In fact we are using “customer” in the agile project planning sense: not as the person who is paying, but the person to whom you are answerable.

Data science problems seem hard to bound. We teach analysts that they will always spend most of their time on data tubing (getting data to the project) and data cleaning (making sure the data is safe to use in the project). So characterizing these steps is a good place to start. There is an Anna Karenina principle hidden in here:

Happy families are all alike; every unhappy family is unhappy in its own way.

Or: perfect data fits in the same way all the time, dirty data always needs unique domain and client specific knowledge and work to use. This is where to start your scoping.

You can plan if you can cut the project into smaller pieces. The trick is that the sub-steps may not be obvious. To help with this we suggest generating a large set of “typical steps” and then keeping rejecting the steps depending if they seem appropriate to the project. Some of the sub-steps come from considering the intended model structure (and factorizations of the intended model structure). Some of the sub-steps come from the nature of the data sources and plausible data flows.

What we suggest is generating reduced and simplified typical stories and then see what pieces of these stories apply to the project at hand. We then suggest keeping the fitting pieces (and there dependencies and consequences) as steps in a project plan. In fact we further suggest parameterizing these stories then multiplying these potential stories and story fragments by varying parameters. For example if a story is “validate input” you might introduce a parameter like “type of input” that is something as simple as “numeric”, “boolean” and “categorical.” You then get three stereotyped stories: “validate numeric input”, “validate boolean input” and “validate categorical input.” You can then discuss where these stories could be used in a project plan and what would be needed to implement them (in each case generating more trackable steps). The intent is not to try to write a project plan while staring at a blank sheet of paper or to try to factor a plan like “model data” into “build first half of model” and “build second half of model.” It is instead to get enough informative sub-stories and sub-steps that you can see if you completed all of the pieces you would indeed complete the project. Then for each step design tests and demonstrations that demonstrate success or failure. Progress and quality can then be measured, tracked and put up for criticism per step (instead of a huge surprise at the end of the project).

Known tools and techniques that should be used by data scientists

To continue our “data science should borrow more from software engineering theme” we would like to discuss two types of tools that most software engineers routinely depend on that are not seen often enough in data science projects.

Version control

It should be a project requirement to have all analysis control files are shared and versioned. Data may be too large to version but all documentation and procedures should be versioned. And data should always have attached meta-data columns to at least allow tracking of its origin and revision. There are many opinions on what version control to use- but essentially there is no excuse to not use one at least as good as Git, Mercurial, BZR, Perforce or Subversion.

I prefer to use Git. It has the following useful properties:

Doesn’t write its own record keeping files into the directories being controlled (Subversion’s habit of doing this can break some tools).
Makes checking out fresh copy (or multiple fresh copies) of the entire workspace easy.
Allows work and creation of versions when disconnected from any central repository.
Allows check-in of binary objects.
Is cross platform (Linux, OSX and Windows).
Has “research why did the project go wrong” features like “blame” and “bisect.”

In this day and age no researcher or developer should ever say “I had it working yesterday, but I lost that copy trying some improvements.” And this is one of the many things source control software was designed to help prevent. Similarly a data scientist should never say “I made some changes and the correlation went down, but I don’t know why.” Version control lets you run your system forward and backward in time to find the controlling change.

Gantt charts and other planning tools

First truth: Gantt charts can be ugly fidgety nightmares. However they solve a key problem: if you have estimates of the durations of project components it should be mechanical to produce estimates of the duration of whole project. A lot of people forget the “garbage in garbage out” principle and think Gantt charts produce estimates from nothing (and often bad estimates at that). That is not true- they (like stickies on a storyboard) combine estimates to help you see the consequences of many smaller estimates. Gantt charts are in fact a good tool.

Gannt chart example from http://upload.wikimedia.org/wikipedia/en/7/73/Pert_example_gantt_chart.gif

A lot of people prefer web tools like Jira, Basecamp or FogBugz. And these tools may be necessary once you attempt distributed time tracking (as passing around files will eventually fail). However for actual planning I find nothing is close to the usability of Microsoft Project or similar desktop applications like OmniPlan. I prefer desktop apps. I find you just can’t plot out consequences of different scope alternatives using a web-app that takes 5 to 10 seconds per click.

However, tools are a matter of taste. The principle is: if your team is going to the trouble to estimate component times for you, you should return the favor by estimating project schedule for them.

Take aways

Data science is an exciting cross-disciplinary field. Most of the theory of data science is in fact statistics. Most of the infrastructure of data science is software engineering. It should not be a huge surprise that software engineer practices (not just the tools and technology) are very helpful in ensuring the success of data science projects. But I feel a reminder helps.