“Close but no cigar” unit tests and bias in MCMC

Statistical Modeling, Causal Inference, and Social Science 2024-04-17

I’m coding up a new adaptive sampler in Python, which is super exciting (the basic methodology is due to Nawaf Bou-Rabee and Tore Kleppe). Luckily for me, another great colleague, Edward Roualdes, has been keeping me on the straight and narrow by suggesting stronger tests and pointing out actual bugs in the repository (we’ll open access it when we put the arXiv paper up—hopefully by the end of the month).

There are a huge number of potential fencepost (off by one), log-vs-exponential, positive-vs-negative, numerator-vs-denominator, and related errors to make in this kind of thing. For example, here’s a snippet of the transition code.

L = self.uturn(theta, rho)
LB = self.lower_step_bound(L)
N = self._rng.integers(LB, L)
theta_star, rho_star = self.leapfrog(theta, rho, N)
rho_star = -rho_star
Lstar = self.uturn(theta_star, rho_star)
LBstar = self.lower_step_bound(Lstar)
if not(LBstar <= N and N < Lstar):
    ... reject ...

Looks easy, right? Not quite. The uturn function returns the number of steps to get to a point that is one step past the U-turn point. That is, if I take L steps from (theta, rho), I wind up closer than to where I started than if I take L - 1 steps. The rng.integers function samples uniformly, but it’s Python, so it excludes the upper bound and samples from {LB, LB + 1, .., L - 1} . That’s correct, because I want to choose a number of steps greater than 1 and less than the point past which you’ve made a U-turn. Let’s just say I got this wrong the first time around.

Because it’s MCMC and I want a simple proof of correctness, I have to make sure the chain’s reversible. So I see how many steps to get one past a U-turn coming back (after momentum flip), which is Lstar. Now I have to grab its lower bound, and make sure that I take a number of steps between the lower bound (inclusive) and upper bound (exclusive). Yup, had this wrong at one point. But the off-by-one error shows up in a position that is relatively rare given how I was sampling.

For more fun, we have to compute the acceptance probability. In theory, it’s just p(theta_star, rho_star, N) / p(theta, rho, N) in this algorithm, which looks as follows on the log scale.

log_accept = (
    self.log_joint(theta_star, rho_star) - np.log(Lstar - LBstar)
    - (log_joint_theta_rho - np.log(L - LB))
)

That’s because p(N | theta_star, rho_star) = 1 / (Lstar - LBstar) given the uniform sampling with Lstar excluded and LBstar included. But then I substituted the uniform distribution for a binomial, and made the following mistake.

log_accept = (
  self.log_joint(theta_star, rho_star) - self.length_log_prob(N, Lstar)
  - (log_joint_theta_rho - self.length_log_prob(N, L))
)

I only had the negation in -np.log(L - LB) because it was equivalent to np.log(1 / (L - LB)) with a subtraction instead of a division. Luckily Edward caught this one in the code review. I should’ve just coded the log density and added it rather than subtracted it. Now you’d think this would lead to an immediate and glaring bug in the results because MCMC is a delicate algorithm. In this case, the issue is that (N - L) and (N - Lstar) are identically distributed and only range over values of roughly 5 to 7. That’s a minor difference in a stochastic acceptance probability that’s already high. How hard was this to detect? With 100K iterations, everything looked fine. With 1M iterations, the estimates of parameters continued to follow a 1 / sqrt(iterations) trend in error, but showed the estimates of parameters squared asymptotic with residual error only after 100K iterations. That is, it required 1M iterations and an evaluation of the means of squared parameters to detect this bug.

I then introduced a similar error when I went to a binomial number of steps selection. I was using sp.stats.binom.logpmf(N, L, self._success_prob) when I should have been using sp.stats.binom.logpmf(N, L - 1, self._success_prob). As an aside, I like SciPy’s clear naming here vs. R’s dbinom(log.p = True, ...). What I don’t like about Python is that the discrete uniform doesn’t include its endpoint. Of course, the binomial includes its endpoint as an option, so these two versions need to be coded off by 1. Of course, I missed the L - 1. This only introduced a bug because I didn’t do the matching adjustment in testing whether things were reversible. That’s if not(1 <= N and N < Lstar) to match the Lstar - 1 in the logpmf() call. If I ran it all the way to L, then I would've needed N <= Lstar. This is another subtle difference that only shows up after more than 100K iterations.

We introduced a similar problem into Stan in 2016 when we revised NUTS to do multinomial sampling rather than slice sampling. It was an off-by-one error on trajectory length. All of our unit tests of roughly 10K iterations passed. A user spotted the bug by fitting a 2D correlated normal with known correlation for 1M iterations as a test and realizing estimates were off by 0.01 when they should've had smaller error. We reported this on the blog back when it happened, culminating in the post Michael found the bug in Stan's new sampler.

I was already skeptical of empirical results in papers and this is making me even more skeptical!

P.S. In case you don't know the English idiom "close but no cigar", here's the dictionary definition from Cambridge (not Oxford!).