The reality is most A/B tests fail, and Facebook is here to help

Junk Charts 2014-04-16

Two years ago, Wired breathlessly extolled the virtues of A/B testing (link). A lot of Web companies are in the forefront of running hundreds or thousands of tests daily. The reality is that most A/B tests fail.

A/B tests fail for many reasons. Typically, business leaders consider a test to have failed when the analysis fails to support their hypothesis. "We ran all these tests varying the color of the buttons, and nothing significant ever surfaced, and it was all a waste of time!" For smaller websites, it may take weeks or even months to collect enough samples to read a test, and so business managers are understandably upset when no action can be taken at its conclusion. It feels like waiting for the train which is running behind schedule.

Bad outcome isn't the primary reason for A/B test failure. The main ways in which A/B tests fail are:

1. Bad design (or no design);

2. Bad execution;

3. Bad measurement.

These issues are often ignored or dismissed. They may not even be noticed if the engineers running the tests have not taken a proper design of experiments class. However, even though I earned an A at school, it wasn't until I started running real-world experiments that I really learned the subject. This is an area in which theory and practice are both necessary.

The Facebook Data Science team just launched an open platform for running online experiments, called PlanOut. This looks like a helpful tool to avoid design and execution problems. I highly recommend looking into how to integrate it with your website. An overview is here, and a more technical paper (PDF) is also available. There is a github page.

The rest of this post gets into some technical, sausage-factory stuff, so be warned.

***

Bad design is when the experiment is set up in such a way that it does not provide data to answer the research question. I will just give one example here, which is one of many, many ways to fail.

Let's say you want to test changing the text on your registration button from "Sign up Now" to "Join Free". You run a standard A/B test randomizing your visitors into two paths, everything else being the same except the different button text. After you present the test result, the business owner asks to look at geographical segments. That is when you realized that your visitors are split 40% English and 60% non-English. Your site doesn't yet have translated pages so everyone sees the English version. What's the problem?

The problem is noise in your data. Since you don't have translated pages, it is very likely that your registrations would be English-speakers-biased even though your traffic is more international. A better test design should have restricted the test to only English-speaking countries. Because the change in the button text is not likely to influence non-English speaking visitors, when you average those results into the average registration rate, they dilute the signal coming from English speakers. You are at risk of making a false negative error.

The examples of failed designs are endless. I will discuss other examples in future posts. Worse than failed designs is no design at all. The Facebook team quotes Ronald Fisher: "To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of."

***

Worse than bad or no design is bad execution of a good design. The essence of designed experiments is controlling the system--making sure everything else is the same so that we can isolate the effect of the treatment being investigated. Let's just say web sites are complex beasts and bad execution is the norm rather than the exception.

Start with the experimental unit. How do you identify individuals? The simplest way is to assign user ids. But user ids are only available when the user logs in. If the same user performs activities in both logged-in and logged-out states, you're out of luck. The next technology available is the cookie. But cookies can be cleaned out, and cookies do not persist across different devices or browsers. There are also caching and other technical complications--solutions for other engineering problems that adversely affect A/B test execution.

At other times, your experiment may not involve individuals. It may involve impressions (page views). You might want to randomize the design of a particular web page regardless of who loads it. There are no ready made identifiers for impressions.

Next is randomization, which is at the core of the A/B testing methodology. Through the years, I have met many formulas for computing random numbers on the fly, half of which are probably not truly random. We should have a  standard for this.

Another issue relates to understanding the structure of the website, and its pathways. This is one of the toughest things to do for a large website. It's easy to unintentionally move some element of a page, not realizing that some visitors pass by that element on the way to the conversion page several steps later. Frequently, it's not that people are negligent--people really didn't know such a pathway existed when the test was designed.

Further, if you run many simultaneous tests, it is extremely important to understand how one test might affect another. Needless to say, this issue is often shoved under the rug. The typical reasoning is if we randomize everything, we will be fine. This is a fallacy: it's only fine if none of your tests produce any significant effects, which I hope is not the case; if the inputs are random, the outputs are random only if the treatment made no difference!

***

The third problem is bad measurement, which is often hidden from view. The most tricky problem is the data you are analyzing are not what you think they are.

For example, the log of the experiment only tells you what treatment the unit was supposed to receive, but not what it actually received. It is considered acceptable to assume that the actual treatment is the design treatment. Or, you may have the opposite problem, that you have a log of the actual treatment but you don't know what the design calls for, which means you are making the assumption that the random assignment of treatment has been properly administered. Needless to say, these assumptions are quite bold. Too bold for my taste.

***

So, Facebook comes to the rescue last week when they introduced PlanOut, which is an internally grown platform for managing and executing online experiments. I have previously commended the Facebook Data Science team on the meticulous way they ran their experiments (link). I'm glad that they are sharing their system with the rest of the world.

While I have not tried the system (we put together something similar at Vimeo, though less fully featured), I noted the following key components:

- abstracting the experimental parameters as a separate layer

- ability to target specific segments for testing

- ability to work with different levels of experimental units (user_ids, cookie_ids, ...)

- ability to set experimental fractions other than 50/50

- inclusion of standard random number generators

- integrated logging

- management interface for multiple simultaneous experiments to different segments

What I hope they will release in the future:

- establishing the experimental units

- quality-control charts

- monitoring reports