We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling

The Endeavour 2018-01-18

Aki points out that in cases of antithetical sampling, our effective sample size calculations were unduly truncated above at the number of iterations. It turns out the effective sample size can be greater than the number of iterations if the draws are anticorrelated. And all we really care about for speed is effective sample size per unit time.

NUTS can be antithetical

The desideratum for a sampler Andrew laid out to Matt was to maximze expected squared transition distance. Why? Because that’s going to maximize effective sample size. (I still hadn’t wrapped my head around this when Andrew was laying it out.) Matt figured out how to achieve this goal by building an algorithm that simulated the Hamiltonian forward and backward in time at random, doubling the time at each iteration, and then sampling from the path with a preference for the points visited in the final doubling. This tends to push iterations away from their previous values. In some cases, it can lead to anticorrelated chains.

Removing this preference for the second half of the chains drastically reduces NUTS’s effectiveness. Figuring out how to include it and satisfy detailed balance was one of the really nice contributions in the original NUTS paper (and implementation).

Have you ever seen 4000 as the estimated n_eff in a default Stan run? That’s probably because the true value is greater than 4000 and we truncated it.

The fix is in

What’s even cooler is that the fix is already in the pipeline and it just happens to be Aki’s first C++ contribution. Here it is on GitHub:

Aki’s also done simulations, so the new version is actually better calibrated as far as MCMC standard error goes (posterior standard deviation divided by the square root of the effective sample size).

A simple example

Consider three Markov processes for drawing a binary sequence y[1], y[2], y[3], …, where each y[n] is in { 0, 1 }. Our target is a uniform stationary distribution, for which each sequence element is marginally uniformly distributed,

Pr[y[n] = 0] = 0.5     Pr[y[n] = 1] = 0.5

Process 1: Independent. This Markov process draws each y[n] independently. Whether the previous state is 0 or 1, the next state has a 50-50 chance of being either 0 or 1.

Here are the transition probabilities:

Pr[0 | 1] = 0.5   Pr[1 | 1] = 0.5
Pr[0 | 0] = 0.5   Pr[1 | 0] = 0.5

More formally, these should be written in the form

Pr[y[n + 1] = 0 | y[n] = 1] = 0.5

For this Markov chain, the stationary distribution is uniform. That is, some number of steps after initialization, there’s a probability of 0.5 of being in state 0 and a probability of 0.5 of being in state 1. More formally, there exists an m such that for all n > m,

Pr[y[n] = 1] = 0.5

The process will have an effective sample size exactly equal to the number of iterations because each state in a chain is independent.

Process 2: Correlated. This one makes correlated draws and is more likely to emit sequences of the same symbol.

Pr[0 | 1] = 0.01   Pr[1 | 1] = 0.99
Pr[0 | 0] = 0.99   Pr[1 | 0] = 0.01

Nevertheless, the stationary distribution remains uniform. Chains drawn according to this process will be slow to mix in the sense that they will have long sequences of zeroes and long sequences of ones.

The effective sample size will be much smaller than the number of iterations when drawing chains from this process.

Process 3: Anticorrelated. The final process makes anticorrelated draws. It’s more likely to switch back and forth after every output, so that there will be very few repeating sequences of digits.

Pr[0 | 1] = 0.99   Pr[1 | 1] = 0.01
Pr[0 | 0] = 0.01   Pr[1 | 0] = 0.99

The stationary distribution is still uniform. Chains drawn according to this process will mix very quickly.

With an anticorrelated process, the effective sample size will be greater than the number of iterations.

Visualization

If I had more time, I’d simulate, draw some traceplots, and also show correlation plots at various lags and the rate at which the estimated mean converges. This example’s totally going in the Coursera course I’m doing on MCMC, so I’ll have to work out the visualizations soon.

The post We were measuring the speed of Stan incorrectly—it’s faster than we thought in some cases due to antithetical sampling appeared first on Statistical Modeling, Causal Inference, and Social Science.