Testing Super Learner’s Coverage – A Note To Myself

R-bloggers 2026-01-04

[This article was first published on r on Everyday Is A School Day, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Testing Super Learner with TMLE showed some interesting patterns XGBoost + random forest only hit ~54% coverage, but tuned xgboost + GLM reached ~90%. Seems like pairing flexible learners with stable (even misspecified) models helps? Need to explore this more with different setups

Motivations:

We have previously looked at TMLE without Super Learner and observed the poor coverage of TMLE with unspecified xgboost when compared to a correctly specified logistic regression. Now since we’ve learnt NNLS the default method behind Super Learner, why dont we test it out and take a look at the coverage! Let’s take a look at our previous coverage plot.

We saw that with xgboost and TMLE, our coverage is about 71.1-87.6% and the lower and upper tails are asymmetrical. What if we use SuperLearner, can we improve these coverage?

Objectives

Let’s Code

We will be using the same data generating mechanism as before, but this time we will be using Super Learner to estimate the propensity score and outcome regression.Let’s write a code where we can use multicore and also a cheap 2nd hand lenovo with quad-core and Ubuntu to run and store results. Part of my learning goal for 2026 is to experiment more with parallel computing and see if we can explore mathematical platonic space more using simulation! Stay tuned on that, more to come on the simple parallel computing cluster set up!

code

library(dplyr)library(SuperLearner)library(future)library(future.apply)# Set up parallel processingplan(multicore, workers = 4)set.seed(1)n <- 10000W1 <- rnorm(n)W2 <- rnorm(n)W3 <- rbinom(n, 1, 0.5)W4 <- rnorm(n)# TRUE propensity score modelA <- rbinom(n, 1, plogis(-0.5 + 0.8*W1 + 0.5*W2^2 + 0.3*W3 - 0.4*W1*W2 + 0.2*W4))# TRUE outcome modelY <- rbinom(n, 1, plogis(-1 + 0.2*A + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2))# Calculate TRUE ATElogit_Y1 <- -1 + 0.2 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2logit_Y0 <- -1 + 0 + 0.6*W1 - 0.4*W2^2 + 0.5*W3 + 0.3*W1*W3 + 0.2*W4^2Y1_true <- plogis(logit_Y1)Y0_true <- plogis(logit_Y0)true_ATE <- mean(Y1_true - Y0_true)df <- tibble(W1 = W1, W2 = W2, W3 = W3, W4 = W4, A = A, Y = Y)tune = list(  ntrees = c(500,1000),           # More trees for better performance  max_depth = c(3,5),                    # Deeper trees capture interactions  shrinkage = c(0.001,0.01)    # Finer learning rate control)learners <- create.Learner("SL.xgboost", tune = tune, detailed_names = TRUE, name_prefix = "xgb")# Super Learner library SL_library <- list(c("SL.xgboost", "SL.ranger", "SL.glm", "SL.mean"),                   c("SL.xgboost","SL.ranger"),                   c("SL.xgboost","SL.glm"),                   list("SL.ranger", c("SL.xgboost", "screen.glmnet")),                   c(learners$names, "SL.glm"))# samplen_sample <- 1000n_i <- 6000# Function to run one TMLE iterationrun_tmle_iteration <- function(seed_val, df, n_i, SL_library) {  set.seed(seed_val)  data <- slice_sample(df, n = n_i, replace = T) |> select(Y, A, W1:W4)    # Prepare data  X_outcome <- data |> select(A, W1:W4) |> as.data.frame()  X_treatment <- data |> select(W1:W4) |> as.data.frame()  Y_vec <- data$Y  A_vec <- data$A    # Outcome model  SL_outcome <- SuperLearner(    Y = Y_vec,    X = X_outcome,    family = binomial(),    SL.library = SL_library,    cvControl = list(V = 5)  )    # Initial predictions  outcome <- predict(SL_outcome, newdata = X_outcome)$pred    # Predict under treatment A=1  X_outcome_1 <- X_outcome |> mutate(A=1)  outcome_1 <- predict(SL_outcome, newdata = X_outcome_1)$pred    # Predict under treatment A=0  X_outcome_0 <- X_outcome |> mutate(A=0)  outcome_0 <- predict(SL_outcome, newdata = X_outcome_0)$pred    # Bound outcome predictions to avoid qlogis issues  outcome <- pmax(pmin(outcome, 0.9999), 0.0001)  outcome_1 <- pmax(pmin(outcome_1, 0.9999), 0.0001)  outcome_0 <- pmax(pmin(outcome_0, 0.9999), 0.0001)    # Treatment model  SL_treatment <- SuperLearner(    Y = A_vec,    X = X_treatment,    family = binomial(),    SL.library = SL_library,    cvControl = list(V = 5)  )    # Propensity scores  ps <- predict(SL_treatment, newdata = X_treatment)$pred    # Truncate propensity scores   ps_final <- pmax(pmin(ps, 0.95), 0.05)    # Calculate clever covariates  a_1 <- 1/ps_final  a_0 <- -1/(1 - ps_final)  clever_covariate <- ifelse(A_vec == 1, 1/ps_final, -1/(1 - ps_final))    epsilon_model <- glm(Y_vec ~ -1 + offset(qlogis(outcome)) + clever_covariate,                        family = "binomial")  epsilon <- coef(epsilon_model)    updated_outcome_1 <- plogis(qlogis(outcome_1) + epsilon * a_1)  updated_outcome_0 <- plogis(qlogis(outcome_0) + epsilon * a_0)    # Calc ATE  ate <- mean(updated_outcome_1 - updated_outcome_0)    # Calc SE  updated_outcome <- ifelse(A_vec == 1, updated_outcome_1, updated_outcome_0)  se <- sqrt(var((Y_vec - updated_outcome) * clever_covariate +                    updated_outcome_1 - updated_outcome_0 - ate) / n_i)    return(list(ate = ate, se = se))}# Run iterations in parallelfor (num in 1:length(SL_library)) {cat("TMLE iterations in parallel with 4 workers (multicore)...\n")start_time <- Sys.time()results_list <- future_lapply(1:n_sample, function(i) {  result <- run_tmle_iteration(i, df, n_i, SL_library[[num]])  if (i %% 100 == 0) cat("Completed iteration:", i, "\n")  return(result)}, future.seed = TRUE)end_time <- Sys.time()run_time <- end_time - start_time# Extract resultspredicted_ate <- sapply(results_list, function(x) x$ate)pred_se <- sapply(results_list, function(x) x$se)# Resultsresults <- tibble(  iteration = 1:n_sample,  ate = predicted_ate,  se = pred_se,  ci_lower = ate - 1.96 * se,  ci_upper = ate + 1.96 * se,  covers_truth = true_ATE >= ci_lower & true_ATE <= ci_upper)# Summary statssummary_stats <- tibble(  metric = c("true_ATE", "mean_estimated_ATE", "median_estimated_ATE",              "sd_estimates", "mean_SE", "coverage_probability", "bias"),  value = c(    true_ATE,    mean(predicted_ate),    median(predicted_ate),    sd(predicted_ate),    mean(pred_se),    mean(results$covers_truth),    mean(predicted_ate) - true_ATE  ))# Create output directory if it doesn't existif (!dir.exists("tmle_results")) {  dir.create("tmle_results")}# Save detailed results (all iterations)write.csv(results, paste0("tmle_results/tmle_iterations",num,".csv"), row.names = FALSE)# Save summary statisticswrite.csv(summary_stats, paste0("tmle_results/tmle_summary",num,".csv"), row.names = FALSE)# Save simulation parameterssim_params <- tibble(  parameter = c("n_population", "n_sample_iterations", "n_bootstrap_size",                 "SL_library", "n_workers", "runtime_seconds"),  value = c(n, n_sample, n_i,             paste(SL_library[[num]], collapse = ", "),             4, as.numeric(run_time, units = "secs")))write.csv(sim_params, paste0("tmle_results/simulation_parameters",num,".csv"), row.names = FALSE)# Save as RData for easy loadingsave(results, summary_stats, sim_params, true_ATE, file = paste0("tmle_results/tmle_results",num,".RData"))

As previously, we’ve set up our true ATE and simulation and will sample 60% of the N=10000 and ran a bootstrap of 1000. This time, instead of solo ML, we will use Super Learner to ensemble several models and then estimate our ATE and if the 95% CI covers the true ATE. Below are the combinations we will investigate:

xgboost + randomforest + glm + mean
xgboost + randomforest
xgboost + glm
randomforest + (xgboost + screen.glmnet) + glm
xgboost_tuned + glm

What’s The Result

Wow, these are very interesting results! Looking at 1) where we have xgboost, randomforest, glm, and mean, our coverage was pretty horrible at ~50%. For 3) with xgboost and glm, coverage increased to ~80%, very similar to our solo tuned xgboost previously but at least with symmetrical non-coverage, very interesting. But when 2) we combined both xgboost and randomforest, the coverage dropped back to ~50%. Same thing happened when we used feature engineering 4) glmnet prior to xgboost, then couple with xgboost, coverage was only ~50%. But, when we 5) tune xgboost and combine it with glm, coverage increased to ~90%! I was not expecting this result but this is quite amazing! With regards to how long it took for each combination for a quadcore processor with multicore feature:

method time_hours SL.xgboost, SL.ranger, SL.glm, SL.mean 4.02 SL.xgboost, SL.ranger 4.00 SL.xgboost, SL.glm 0.47 SL.ranger, SL.xgboost, screen.glmnet 4.23 tuned SL.xghboost, SL.glm 3.29

Opportunities for improvement

attempt using optimized hyperparameters for both xgboost and randomforest
write a function for splitting multicore to several nodes
try using NNloglik / maximize on auc instead of nnls, would it work better?
need to test the theory that whether set.seed works differently with multicore? Or does future automatically chooses L'Ecuyer-CMRG Algorithm? does our current method of set.seed(i) of iteration number work ok?
need to try mcSuperLearner, built-in multicore

Lessons Learnt

learnt future_lapply
learnt multicore is so much faster!!! Though when we future_lapply the function it appears to have the same time as multisession
tested Super Learner, it’s pretty cool! It already has options for parallel computing
found out that mixing flexible model like xgboost (tuned) and logreg regularizes the ensembled model

If you like this article:

please feel free to send me a comment or visit my other blogs
please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
if you would like collaborate please feel free to contact me

To leave a comment for the author, please follow the link and comment on their blog: r on Everyday Is A School Day.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Continue reading: Testing Super Learner’s Coverage – A Note To Myself