A bit more on sample size

Win-Vector Blog 2013-03-15

In our article What is a large enough random sample? we pointed out that if you wanted to measure a proportion to an accuracy “a” with chance of being wrong of “d” then a idea was to guarantee you had a sample size of at least:

NewImage

This is the central question in designing opinion polls or running A/B tests. This estimate comes from a quick application of Hoeffding’s inequality and because it has a simple form it is possible to see that accuracy is very expensive (to halve the size of difference we are trying to measure we have to multiply the sample size by four) and the cheapness of confidence (increases in the required confidence or significance of a result cost only moderately in sample size).

However, for high-accuracy situations (when you are trying to measure two effects that are very close to each other) suggesting a sample size that is larger than is strictly necessary (as we are using an bound, not an exact formula for the required sample size). As a theorist or a statistician we like to error on the side of too large a sample (guaranteeing reliability), but somebody who is paying for each entry in a poll would want a smaller size.

This article shows a function that computes the exact size needed (using R).

The bound we gave in What is a large enough random sample? is correct: a sample of the stated size always at least achieves the desired accuracy and significance goals. Sample size and significance seem a bit abstract if you forget the underlying points: too small sample size and you can’t state a conclusion with any confidence and too large a sample size and you have spent too much money on your experiments. This is a central issue in measurement when you are measuring something serious (a clinical trial, an opinion poll or an A/B test) or even measuring something silly (the statistics of a game). In addition to knowing how to estimate significance of an experiment after the fact you need to know how to design an experiment to achieve significance; and that is largely picking a big enough sample size, the subject of this article.

We could get a better bound on sample size by using a more detailed version of the Chernoff bound that better accounts for small sample sizes. Or, as we will do here, we can say bounds are only useful if they are simple (so they give us usable intuition) and move on to an exact calculation.

The exact sample size needed is determined by a simple use of the binomial theorem (used to calculate how often a distribution of coin flips exhibits a given range of averages). The R code to find the exact sample size is given in the function binomsize(a,d) below:

library('gtools')estimate = function(a,d) { -log(d/2)/(2*a^2) }sig = function(a,n) { pbinom(floor(n*0.5)-floor(n*a),size=floor(n),prob=0.5) }binomsize = function(a,d) {   r=c(1,2*estimate(a,d))  v=binsearch(function(n) {sig(a,n) - d},range=r,lower=min(r),upper=max(r))  v$where[[length(v$where)]] }

For example: binomsize(0.1,0.05) = 80 tells us that a sample size of 80 is enough to measure a difference in rates as small as 0.1 with a chance of mis-measurement of no more then 0.05. That is if you want to measure the popularity of a single candidate to with +=10% with no more than a 0.05 changes of being wrong, we need a sample size of at least 80 respondents. In a poll of 80 people if your candidate is marked as favorable by more than 60% of the time then with 19 chances out of 20 they are in fact the more popular candidate (also assuming your sample of 80 was truly representative). On the other hand, estimate(0.1,0.05) is 184.4, and is more than twice the minimum necessary size (though it is safe to use).

Our estimate was designed to always at least the true value (so it is a valid bound), but it is often much larger than the needed value. Will illustrate this with the command below which yield the plot that follows.

library('ggplot2')library('reshape2')d = data.frame(accuracy=10^seq(from=-0.9,to=-7,by=-0.1))d$estimateSize = estimate(d$accuracy,0.05)d$binomialSize = sapply(d$accuracy,function(x){ binomsize(x,0.05)})dmelt = melt(d,id.vars=c('accuracy'))ggplot(data=dmelt,aes(x=1/accuracy,y=value,color=variable)) +     geom_line()

Rate1

The difference doesn’t look so bad if we plot on a log-log scale “on which anything looks like a straight line” (F. J. Yndurain, 1996):

ggplot(data=dmelt,aes(x=1/accuracy,y=value,color=variable)) +    geom_line() + scale_y_log10() + scale_x_log10()

Rate2

In fact we can see the ratio of the estimate over the actual needed sample size is approaching e:

ggplot(data=d,aes(x=1/accuracy,y=estimateSize/binomialSize)) +   geom_line() + scale_x_log10()

Rate3

So we can use the following formula as a “good rule of thumb” (but not as a bound, as it is always not quite a large enough sample!): you should always have a sample size larger than:

NewImage

Our advice is: use our bound or the rule of thumb to plan. But the time comes to run your test, quickly use the binomial formula given above.