Bias remaining after adjusting for pre-treatment variables. Also the challenges of learning through experimentation.

Statistical Modeling, Causal Inference, and Social Science 2024-12-10

Hey, this is the kind of post that everyone absolutely loves. No John Updike, no Joe Biden, no authors or politicians at all. Nothing about p-values or scientific misconduct. It’s not even Bayesian! Just statistics, teaching, and code.

So sit back and enjoy . . .

We had the following problem on the practice exam:

An observational study is simulated using the following code for pre-test x, treatment z, and post-test y:
n <- 100
x <- runif(n, -1, 1)
z <- rbinom(n, 1, invlogit(x))
y <- 0.2 + 0.3*x + 0.5*z + rnorm(n, 0, 0.4)
fake <- data.frame(x, y, z)
fit <- stan_glm(y ~ x + z, data=fake, refresh=0)
estimate <- coef(fit)["z"]
In this simulation, the true treatment eﬀect is 0.5. Which of the following statements is correct?

(a) The estimate will probably be less than 0.5 because the model also adjusts for x, which is positively correlated with z, and this adjustment for 𝑥 will suck up some of the explanatory power of x.

(b) The estimate will probably be greater than 0.5 because there is imbalance in the treatment assignment: the treatment is more likely to be assigned to people with higher pre-test scores, which will artificially make the treatment look more effective.

(c) The estimate will probably be greater than 0.5 because, given the finite sample size, the adjustment for x is noisy and is likely to undercorrect for imbalance in the design.

(d) The estimate is unbiased because the model correctly adjusts for differences in pre-test between treatment and control groups.

OK, first take a moment to figure this one out, and then we will go on. Ready?

The correct answer is (d), which we can check using a simple simulation in R:

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 0.3*x + 0.5*z + rnorm(n, 0, 0.4)
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ x + z, data=fake)
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

Here's the result I got:

[1] 0.4994 0.0027

If you run this at home, you'll get a different result, because I have not set the random seed.

Just for laffs, I'll run it again. Here's what pops out:

[1] 0.5028 0.0026

As theory predicts, the simulation results are consistent with zero bias, a result which is mathematically obvious once you realize that you're fitting unregularized least squares to data simulated from the class of models that you're fitting.

After doing this simulation, we discussed the problem in class. One question that arose was, why did we do this weird treatment assignment with invlogit(x). My response was that I wanted to demonstrate what would happen when there's an imbalance between treatment and control groups. The way the code is set up above, the probability of getting the treatment is higher with the pre-test measurement is higher. An example could be if x is a standardized pre-test score and z is some accelerated education program that better-scoring students are more likely to receive.

For the purpose of teaching regression and causal inference, the point of this example is that in the regression analysis you should adjust for all variables that are predictive of both the treatment and the outcome. The treatment is unbalanced over x, so you should adjust for x in your model. Also in general it would be a good idea to include the interaction of x and z, but that's another story.

What happens if you ignore the imbalance and don't adjust for x? Then you'll get a bias, as you can see from a simulation:

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 0.3*x + 0.5*z + rnorm(n, 0, 0.4)
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ z, data=fake)  # Regression does not include x
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

The result:

[1] 0.5994 0.0027

There's a big bias here--the true value of the coefficient is 0.5. In this case, the treated units are more likely to have high values of x, and x is positively correlated with y, so if you don't adjust, you'll get that positive bias.

But, I explained to the students, once you do this adjustment, your inference becomes sensitive to the form of the adjustment. In the above simulation, the true model of y given x and z is linear (with independent errors), and then we fit a linear model, so all is good. But in real life the data won't come from an underlying linear model, so some bias will remain after adjusting for x.

One of the students didn't believe me--he had understood that, once you adjust for x in least squares regression, there'd be no bias. I said, no, bias will remain. Once you have imbalance, your adjustment is sensitive to the model.

He still didn't believe me, so I said, Hey, I'll show you in the code! I'll just change the original bit of code above and change x to x^2 in the simulation of y. Now the underlying model is no longer linear, but we'll still do a linear adjustment, and there should be a bias.

Here goes:

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 0.3*x^2 + 0.5*z + rnorm(n, 0, 0.4)   # Replaced x with x^2
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ x + z, data=fake)
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

And here's what happens:

[1] 0.4984 0.0028

Damn! There's no bias. Or, at least, no detectable bias. If there is a bias, it's so small as to not matter in any realistic applied context.

What happened?

A student suggested that the problem is that x is centered at zero, so the quadratic term will be estimated as a zero linear slope, and the errors would cancel out. I didn't follow all the details of that reasoning, but I could believe it could be possible.

So let's center the quadratic not at zero:

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 0.3*(x+1)^2 + 0.5*z + rnorm(n, 0, 0.4)   # Replaced x with (x+1)^2
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ x + z, data=fake)
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

Which yields:

[1] 0.5034 0.0028

Nope.

What about a cubic? I'm flailing here, gotta get an underlying model that's far enough away from linear to show that bias:

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 0.3*x^3 + 0.5*z + rnorm(n, 0, 0.4)   # Replaced x with x^3
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ x + z, data=fake)
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

Which yields:

[1] 0.5003 0.0026

Snake eyes again!

Wait--here's a thought! That x^3 has a coefficient of just 0.3, so it's a small bias. What if we make it larger?

library("arm")
n_loop <- 1000
estimate <- rep(NA, n_loop)
for (loop in 1:n_loop) {
  n <- 100
  x <- runif(n, -1, 1)
  z <- rbinom(n, 1, invlogit(x))
  y <- 0.2 + 10*x^3 + 0.5*z + rnorm(n, 0, 0.4)   # Replaced 0.3*x with 10*x^3
  fake <- data.frame(x, y, z)
  fit <- glm(y ~ x + z, data=fake)
  estimate[loop] <- coef(fit)["z"]
}
print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=2)

Here we get:

[1] 0.47 0.01

Hmmm, this is looking better. Let's give it another decimal place (by the way, I have no idea why R gave my output to 4 digits earlier but now it's only giving 2. I was setting digits=2 the whole time!). Here goes:

print(c(mean(estimate), sd(estimate)/sqrt(n_loop)), digits=3)

And here's the result:

[1] 0.4734 0.0102

Oh, it's that annoying auto-rounding thing that R does. Anyway, yeah, this looks like a real bias.

Just to be sure, I will re-do, setting n_loop <- 10^4, which takes a few seconds but then satisfyingly produces this clean result:

[1] 0.4749 0.0032

140 standard errors away from zero . . . yeah, that's a real bias!

Some lessons from this example

1. If you have imbalance on pre-treatment variables and you adjust for them, that's fine, but your estimate will now be sensitive to your adjustment model.

2. But the sensitivity isn't as strong as I thought! It took a lot of work to produce a noticeable bias. I'm sure that with a little bit of thought I could tweak the example to make the bias much bigger, but I think it counts for something that my first ideas didn't work.

3. You can learn a lot from simulation.

4. There's a reason I like to prepare computer demonstrations ahead of time (as in our Active Statistics book). If you try to do it live, you can get into all sorts of tangles. That said, the occasional tangle can be informative to students. You need a balance.

5. The fact that computer demonstrations can be so brittle is itself an interesting lesson. When we teach statistics with clean-data examples, there's a risk that students will miss the importance of data preparation. Similarly, when we use prepared, well-oiled demonstrations, we're not fully conveying the challenges of learning through experimentation.