A must-read paper on statistical analysis of experimental data

Statistical Modeling, Causal Inference, and Social Science 2013-03-15

Russ Lyons points to an excellent article on statistical experimentation by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu, a group of software engineers (I presume) at Microsoft. Kohavi et al. write:

Online controlled experiments are often utilized to make data-driven decisions at Amazon, Microsoft . . . deployment and mining of online controlled experiments at scale—thousands of experiments now—has taught us many lessons.

The paper is well written and has excellent examples (unfortunately the substantive topics are unexciting things like clicks and revenue per user, but the general principles remain important). The ideas will be familiar to anyone with experience in practical statistics but don’t always make it into textbooks or courses, so I think many people could learn a lot from this article. I was disappointed that they didn’t cite much of the statistics literature—not even the classic Box, Hunter, and Hunter book on industrial experimentation—but that’s probably because most of the statistics literature is so theoretical.

Several of their examples rung true for me; for example this pair of graphs which illustrates how people can be fooled by “statistically significant” or “nearly significant” results:

expt1 expt2

The graphs aren’t so pretty but I guess that’s what happens when you work for Microsoft and you have to do all your graphs in Excel . . . anyway, this sort of thing is what’s behind problems like the notorious sex-ratio study. And their point about autocorrelation of cumulative averages reminded me of the “55,000 residents desperately need your help!” study that was featured in my book with Jennifer.

I was impressed that this group of people, working for just a short period of time, came up with and recognized several problems that it took me many years to notice. Working on real problems, and trying to get real answers, that seems to make a real difference (or so I claim without any controlled study!). The motivations are much different in social science academia where the goal is to get statistical significance, publish papers, and establish a name for yourself via new and counterintuitive findings. All of that is pretty much a recipe for wild goose chases.

P.S. Sorry, they do cite Box, Hunter, and Hunter—I’d missed that!