Thinking about covariates in an analysis of an RCT
R-bloggers 2025-01-28
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was recently discussing the analytic plan for a randomized controlled trial (RCT) with a clinical collaborator when she asked whether it’s appropriate to adjust for pre-specified baseline covariates. This question is so interesting because it touches on fundamental issues of inference—both causal and statistical. What is the target estimand in an RCT—that is, what effect are we actually measuring? What do we hope to learn from the specific sample recruited for the trial (i.e., how can the findings be analyzed in a way that enhances generalizability)? What underlying assumptions about replicability, resampling, and uncertainty inform the arguments for and against covariate adjustment? These are big questions, which won’t necessarily be answered here, but need to be kept in mind when thinking about the merits of covariate adjustment
Some researchers resist covariate adjustment in the primary analysis, concerned that it might complicate interpretability or limit transparency. Others might like the straightforward clarity and simplicity of the randomized comparison. But perhaps the biggest issue that people have with covariate adjustment is a longstanding concern that flexible modeling could turn into a fishing expedition—searching for covariates that yield the most favorable effect estimate.
After that conversation with my colleague, I revisited a 1994 paper by Stephen Senn, which argues that rather than checking for chance covariate imbalances before making adjustments, “the practical statistician will do well to establish beforehand a limited list of covariates deemed useful and fit them regardless. Such a strategy will usually lead to a gain in power, has no adverse effect on unconditional size and controls conditional size with respect to the covariates identified.” A subsequent paper by Pocock et al. reinforces this perspective. Although they note that “experience shows that for most clinical trials, analyses which adjust for baseline covariates are in close agreement with the simpler unadjusted treatment comparisons”, they argue adjusting for covariates can be justified if it helps: (1) achieve the most appropriate p-value for treatment differences, (2) provide unbiased estimates, and (3) improve precision.
Motivated by Pocock et al., I created some simulations to explore the operational characteristics of covariate adjustment that I’m sharing here. I’ve been distracted more recently with paper writing and manuscript editing, so I am happy to get back to a little R
coding.
Simulations
To get things started, here are the R
packages used in the simulations.
library(simstudy) library(data.table) library(stargazer) library(parallel)
Data definitions
I am using two sets of data definitions here, splitting up the creation of baseline covariates (\(x_1\) and \(x_2\)) and group assignment from the outcome \(y\). We are assuming that \(y\) has a Gaussian (normal) distribution.
def_c <- defData(varname = "x1", formula = 0, variance = "..s_x1^2", dist = "normal") |> defData(varname = "x2", formula = 0, variance = "..s_x2^2", dist = "normal") |> defData(varname = "A", formula = "1;1", dist = "trtAssign") def_y <- defDataAdd( varname = "y", formula = "5 + ..delta * A + ..b1 * x1 + ..b2 * x2", variance = "..s_y^2", dist = "normal" )
Initial parammeters
Here are the parameters used in the data generation. \(x_1\) is highly correlated with the outcome \(y\), whereas \(x_2\) is not. In the first set of simulations, we are assuming a true effect size \(\delta = 5\).
s_x1 <- 8 s_x2 <- 9 s_y <- 12 b1 <- 1.50 b2 <- 0.0 delta <- 5
Single data set generation
L’Ecuyer’s Combined Multiple Recursive Generator (CMRG) random number generator is being used here, because the replication of the data (and the analyses) are done using a parallel process to speed things up a bit.
RNGkind("L'Ecuyer-CMRG") set.seed(55) dc <- genData(250, def_c) dd <- addColumns(def_y, dc) head(dd) ## Key: <id> ## id x1 x2 A y ## <int> <num> <num> <int> <num> ## 1: 1 3.4075117 -1.66327988 0 8.1461927 ## 2: 2 -0.3040474 -5.60073657 0 -1.7577859 ## 3: 3 -4.4460516 1.20189340 1 -1.6324336 ## 4: 4 0.6834332 0.09974478 1 9.5938736 ## 5: 5 -4.6324773 7.85745373 0 -0.4782144 ## 6: 6 -6.5650815 0.49812462 0 -3.4082489
For this single data set, we can see that the means for \(x_1\) within each group are slightly different, while the means for \(x_2\) are more similar.
dd[, .(mu_x1 = mean(x1), mu_x2 = mean(x2)), keyby = A] ## Key: <A> ## A mu_x1 mu_x2 ## <int> <num> <num> ## 1: 0 0.3960784 -0.36510184 ## 2: 1 -0.4821337 0.08564994
These differences are confirmed by calculating the standardized imbalance \(Z_x\) and the standardized difference \(d\). (The difference between \(Z_x\) and \(d\) is that \(Z_x\) has an adjustment for the group sample sizes.)
calc_diff <- function(dx, rx, v) { mean_diff <- dx[get(rx)==1, mean(get(v))] - dx[get(rx)==0, mean(get(v))] s_pooled <- sqrt( (dx[get(rx)==1, (.N - 1) * var(get(v))] + dx[get(rx)==0, (.N - 1) * var(get(v))] ) / dx[, .N - 2]) Z_x <- mean_diff / ( s_pooled * sqrt(1/dx[get(rx)==0, .N] + 1/dx[get(rx)==1, .N]) ) d <- mean_diff / s_pooled return(list(Z_x = Z_x, d = d)) } calc_diff(dd, "A", "x1") ## $Z_x ## [1] -0.9424664 ## ## $d ## [1] -0.1192136 calc_diff(dd, "A", "x2") ## $Z_x ## [1] 0.3862489 ## ## $d ## [1] 0.04885705
As designed, \(x_1\) is strongly correlated with the outcome \(y\), whereas \(x_2\) is not.
dd[, .(rho_x1.y = cor(x1, y), rho_x2.y = cor(x2, y))] ## rho_x1.y rho_x2.y ## <num> <num> ## 1: 0.6807215 -0.003174757
Model estimation
We fit four models to this data: (1) no adjustment for the covariates, (2) adjusting for \(x_1\) alone, (3) adjusting for \(x_2\) alone, and (4) adjusting for both covariates.
model_1 <- lm(data = dd, formula = y ~ A) model_2 <- lm(data = dd, formula = y ~ A + x1) model_3 <- lm(data = dd, formula = y ~ A + x2) model_4 <- lm(data = dd, formula = y ~ A + x1 + x2)
Two key takeaways from this single data set are that (1) since \(x_1\) is a (albeit weak) confounder, failing to adjust for the covariate leads to an underestimation of the treatment effect (due to the (small) negative correlation of \(x_1\) and \(A\)), and (2) since \(x_1\) is so highly correlated with \(y\), the models that adjust for \(x_1\) have lower standard errors for the treatment effect estimate (around 2.0 for models 1 and 3, and closer to 1.5 for models 2 and 4).
## ## ============================================ ## (1) (2) (3) (4) ## -------------------------------------------- ## A 3.040 4.367*** 3.044 4.367*** ## (2.039) (1.480) (2.044) (1.484) ## ## x1 1.512*** 1.512*** ## (0.101) (0.101) ## ## x2 -0.010 0.001 ## (0.111) (0.081) ## ## Constant 6.237*** 5.638*** 6.234*** 5.639*** ## (1.442) (1.046) (1.446) (1.048) ## ## ============================================ ## ============================================ ##
Operating characteristics (based on replicated data sets)
In order to understand the relative merits of the different modeling approaches, we need to replicate multiple data sets under the same set of assumptions used to generate the single data set. We will generate 2000 data sets and estimate all four models for each data set. For each replication, we use the function est_ancova
to calculate a one-sided p-value. We will keep track of the point estimate, the standard error estimate, and the p-value for each iteration.
est_ancova <- function(dx, vars) { formula <- as.formula(paste("y ~", paste(vars, collapse = " + "))) model <- lm(data = dx, formula = formula) coef_summary <- summary(model)$coefficients["A", ] t_stat <- coef_summary["t value"] p_value <- pt(t_stat, df = model$df.residual, lower.tail = FALSE) ests <- data.table(t(coef_summary[1:2]), p_value) setnames(ests, c("est", "se", "pval")) return(ests) } replicate <- function() { dc <- genData(250, def_c) dd <- addColumns(def_y, dc) est_1 <- est_ancova(dd, vars = "A") est_2 <- est_ancova(dd, vars = c("A", "x1")) est_3 <- est_ancova(dd, vars = c("A", "x2")) est_4 <- est_ancova(dd, vars = c("A", "x1", "x2")) return(list(est_1 = est_1, est_2 = est_2, est_3 = est_3, est_4 = est_4)) } res <- mclapply(1:2000, function(x) replicate())
All four models yield relatively unbiased estimates, though the models that adjust for \(x_1\) (the potential confounder) result in reduced bias relative to those that do not. However, the clear advantage of models 2 and 4 (those that adjust for \(x_1\)) is the reduced variance of the treatment effect estimator:
get.field <- function(x, field) { data.table(t(sapply(x, function(x) x[[field]]) )) } ests <- rbindlist(lapply(res, function(x) get.field(x, "est"))) sapply(ests, function(x) c(bias = mean(x) - delta, var = var(x))) ## est_1 est_2 est_3 est_4 ## bias -0.05513569 -0.00157402 -0.05263476 -0.0002470811 ## var 4.51844704 2.33883113 4.55788040 2.3507163302
The reduction in variance translates directly to increased power for the models that adjust for \(x_1\), from about 63% to 90%. This seems like a pretty good reason to adjust for a baseline covariate that (a) you collect, and (b) is highly correlated with the outcome.
pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval"))) sapply(pvals, function(x) c(mean(x < 0.025))) ## est_1 est_2 est_3 est_4 ## 0.6255 0.9055 0.6275 0.9035
Exploring Type 1 error rates
The flip side of statistical power is the Type 1 error - the probability of concluding that there is a treatment effect when in fact there is no treatment effect. We can assess this by setting \(\delta = 0\) and running another large number of replications. If we do this 2,000 times by generating a completely new data set each time, we see that the observed error rates are close to 0.025 for all the models, though the models that adjust for \(x_1\) are closer to the theoretical value.
delta <- 0 res <- mclapply(1:2000, function(x) replicate()) pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval"))) sapply(pvals, function(x) c(mean(x < 0.025))) ## est_1 est_2 est_3 est_4 ## 0.0180 0.0265 0.0185 0.0285
Both Senn and Pocock et al. suggest that a key advantage of adjusting for baseline covariates is that it helps achieve the desired error rates, particularly for one-sided tests. Assuming that all possible RCTs are conducted with the same level of covariate imbalance, models that include baseline covariate adjustments will yield accurate error rates. In contrast, models that do not adjust for important (highly correlated) covariates will produce deflated error rates. This occurs primarily because the standard errors of the effect estimates are systematically overestimated, reducing the likelihood of ever rejecting the null hypothesis.
To mimic the requirement that the dataset is sampled conditional on a fixed level of covariate imbalance, we generate the baseline covariates and treatment assignment only once, while the outcome is generated anew for each dataset. Under this approach, covariates and treatment assignment are fixed and only the outcome for a particular unit varies across iterations. An alternative approach would be to generate a large number of datasets using the full randomization process—creating new covariate values, treatment assignments, and outcomes for each iteration. Data sets would only be analyzed if they match the pre-specified covariate imbalance level. Although this approach yields the same results as our chosen method (I confirmed with simulations not shown here), the sampling process appears overly artificial, further complicating the interpretation of the p-value.
replicate_2 <- function() { dd <- addColumns(def_y, dc) est_1 <- est_ancova(dd, vars = "A") est_2 <- est_ancova(dd, vars = c("A", "x1")) est_3 <- est_ancova(dd, vars = c("A", "x2")) est_4 <- est_ancova(dd, vars = c("A", "x1", "x2")) return(list(est_1 = est_1, est_2 = est_2, est_3 = est_3, est_4 = est_4)) } dc <- genData(250, def_c) res <- mclapply(1:2000, function(x) replicate_2()) pvals <- rbindlist(lapply(res, function(x) get.field(x, "pval"))) sapply(pvals, function(x) c(mean(x < 0.025))) ## est_1 est_2 est_3 est_4 ## 0.0075 0.0285 0.0070 0.0280
Causal inference methods for balancing
To me, the strongest argument against adjusting for baseline covariates in the analysis is the risk that investigators may appear overly eager to demonstrate the intervention’s success. Pre-specifying the analysis plan goes a long way toward alleviating such concerns. Additionally, alternative approaches from causal inference methods can further reduce reliance on outcome model assumptions. In particular, balancing methods such as inverse probability weighting (IPW) and overlapping weights (OW) can address covariate imbalances while preserving the original estimand. These techniques re-weight the sample to create balanced pseudo-populations without directly modifying the outcome model, offering a viable alternative to regression-based adjustments. They have the advantage of separating the design model from the outcome model (since the exposure and outcome models are two distinct steps). The balancing can be done before looking at the outcome data - so no risk of fishing for results. I plan on sharing simulations using these approaches sometime in the future.
References:
Senn, Stephen. “Testing for baseline balance in clinical trials.” Statistics in medicine 13, no. 17 (1994): 1715-1726.
Pocock, Stuart J., Susan E. Assmann, Laura E. Enos, and Linda E. Kasten. “Subgroup analysis, covariate adjustment and baseline comparisons in clinical trial reporting: current practice and problems.” Statistics in medicine 21, no. 19 (2002): 2917-2930.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.