My Messy Notes on Building a Super Learner: Peeking Under The Hood of NNLS
R-bloggers 2025-12-22
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tried building Super Learner from scratch to understand whatās happening under the hood. Walked through the NNLS algorithm step-by-stepāturns out ensembling models may beat solo models! Our homegrown version? Surprisingly close to nnls package results
But, does it really work in real life?

Motivations
Previously we have learnt the workflow of TMLE and most people would say to use it with Super Learner. But what is Super Learner? The name sounds fancy and cool. Letās take a look under the hood of how super is this Super Learner. In this blog, we will see what non-negative least square is and what is the algorithm that is behind this method that fuels Super Learner. Weāll take a look at the mathematical procedures and then code from scratch and see if we can reproduce the result. Letās do this!
Objectives:
- What is Super Learner?
- What is the algorithm behind Super Learner?
- Letās Put Them All Together
- Opportunities for improvement
- Lessons learnt
What is Super Learner?
Super Learner is an ensemble machine learning algorithm that optimally combines predictions from multiple candidate algorithms to create a single ensembled model. Rather than selecting a single ābestā model through traditional model selection methods, Super Learner leverages the strengths of various algorithms by creating a weighted average of their predictions. The fundamental insight is elegant: why choose between a random forest, generalized linear model, or gradient boosting machine when you can let the data determine the optimal combination of all three? This approach was introduced by Mark van der Laan and colleagues and has become particularly popular in causal inference and epidemiology, often paired with Targeted Maximum Likelihood Estimation (TMLE) to obtain robust, efficient estimates of causal effects.
What is the algorithm behind Super Learner?
The beauty of Super Learner lies in its theoretical guarantee: it will perform at least as well as the best single algorithm in your library of candidate learners, and often performs substantially better. This property, known as the oracle inequality, means that Super Learner asymptotically achieves the lowest possible prediction error among the combinations of the candidate algorithms.
To be transparent, I donāt really understand all these. But, letās move on. The engine behind this is Non-Negative Least Squares (NNLS), an elegant constrained optimization method that finds the optimal weights for combining your candidate algorithms.
NNLS
At its core, NNLS solves a seemingly simple problem: given a matrix X of predictions from your candidate algorithms and an outcome vector y, find weights β that minimize the squared prediction error ||y - Xβ||² subject to two crucial constraints:
- all weights must be non-negative (β ℠0)
- the weights must sum to one (ensuring a proper convex combination).
Lawson-Hanson Algorithm
The most commonly used algorithm for solving NNLS is the active set method developed by Lawson and Hanson in 1974. This iterative algorithm is remarkably intuitive: it maintains two sets of variablesāan āactive setā of variables currently in the model with positive weights, and a āpassive setā of variables currently excluded (with zero weights). The algorithm begins with all variables in the passive set, then iteratively identifies which passive variable, if added to the active set, would most improve the fit. Once a variable enters the active set, the algorithm solves an unconstrained least squares problem using only the active variables. If any weights become negative during this step, the algorithm removes the most negative variable from the active set and repeats the process. This addition-and-removal dance continues until no passive variables would improve the fit and all active variables have positive weightsāat which point weāve found our optimal solution.
OK, too many words above. Not a fan.
ā
Lots of procedures above, letās break it down to steps and write a simple example with code to go through the process. Letās create a simple example.
X <- rbind( c(1.5,3,4), c(0.5,2,3), c(4.5,6,6))y <- c(2,1,5)
$$\begin{gather}\text{X} =\begin{bmatrix}1.5 & 3 & 4 \\0.5 & 2 & 3 \\4.5 & 6 & 6\end{bmatrix};\text{y} =\begin{bmatrix}2 \\ 1 \\ 5\end{bmatrix}\end{gather}$$
Letās take a quick look at the matrices above. Just glancing at it you would think the weights for each models (columns) should be within column 1 and column 2. Letās go through Lawson-hanson algorithm procedure
Step 0: Initialize Your Sets
Start with all variables in the passive set (R) and none in the active set (P). Like so:
$$\text{P} = \emptyset$$
$$R = \{1, 2, 3\}$$$$\beta =\begin{bmatrix}0 \\ 0 \\ 0\end{bmatrix}$$
-
P: Active Set (Take note, I used P as active, not passive; also these are indexes) -
R: Passive Set (take note, these are indexes) -
β: weights
We will go throught the iterative procedure below, move Passive set (R) one by one to Active set (P) until we no longer have any passive sets available.
Step 1 Find Gradient
code# step 1: find gradient gradient <- t(X) %*% (y-X %*% beta)if (sum(gradient<=0)==dim(X)[2]) { stop("all gradients are zero or negative, we have achieved optimality") }if (length(R)==0) { stop("R is empty")}$$\begin{gather}\text{Gradient} = \text{X}^{\text{T}} \cdot (\text{y} - \text{X}\beta)\end{gather}$$The above is the procedure to find gradient for ||y - Xβ||². Letās put in the numbers and calculate
$$\begin{gather}\text{Gradient} = \text{X}^{\text{T}} \cdot (\text{y} - \text{X}\beta) \\= \begin{bmatrix}1.5 & 3 & 4 \\0.5 & 2 & 3 \\4.5 & 6 & 6\end{bmatrix}^\text{T} \cdot (\begin{bmatrix}2 \\ 1 \\ 5\end{bmatrix} -\begin{bmatrix}1.5 & 3 & 4 \\0.5 & 2 & 3 \\4.5 & 6 & 6\end{bmatrix}\begin{bmatrix}0 \\ 0 \\ 0\end{bmatrix}) \\= \begin{bmatrix}1.5 & 0.5 & 4.5 \\3 & 2 & 6 \\4 & 3 & 6\end{bmatrix} \cdot (\begin{bmatrix}2 \\ 1 \\ 5\end{bmatrix} -\begin{bmatrix}1.5 & 3 & 4 \\0.5 & 2 & 3 \\4.5 & 6 & 6\end{bmatrix}\begin{bmatrix}0 \\ 0 \\ 0\end{bmatrix}) \\= \begin{bmatrix}1.5 & 0.5 & 4.5 \\3 & 2 & 6 \\4 & 3 & 6\end{bmatrix} \cdot\begin{bmatrix}2 \\ 1 \\ 5\end{bmatrix} \\= \begin{bmatrix}26.5 \\ 44 \\ 41\end{bmatrix}\end{gather}$$
Step 2 Check Optimality & Find Next Variable to Add To P
If all gradients are non-positive, we have achieved optimality. If not, proceed to the next step. Find the index of R of the maximum gradient. In this case, max of 26.5, 44, 41 is 44, which is the second column of X.
$$\begin{gather}\text{Next Variable} = \text{argmax}_{j \in R} \text{Gradient}_j \\= 2\end{gather}$$We then move 2 from R (passive set) to P (active set) like so:
$$\text{P} = \{2\}$$$$R = \{1, 3\}$$
Step 3 Solve the Unconstrained Least Squares Problem for Active Set P.
$$\begin{gather}\beta_P = (\text{X}_P^{\text{T}} \cdot \text{X}_P)^{-1} \cdot \text{X}_P^{\text{T}} \cdot \text{y} \\= (\begin{bmatrix}3 \\2 \\6\end{bmatrix}^{\text{T}} \cdot\begin{bmatrix}3 \\2 \\6\end{bmatrix})^{-1} \cdot\begin{bmatrix}3 \\2 \\6\end{bmatrix}^{\text{T}} \cdot\begin{bmatrix}2 \\1 \\5\end{bmatrix} \\= (9 + 4 + 36)^{-1} \cdot\begin{bmatrix}3 & 2 & 6\end{bmatrix} \cdot\begin{bmatrix}2 \\1 \\5\end{bmatrix} \\= 49^{-1} \cdot\begin{bmatrix}3 \cdot 2 + 2 \cdot 1 + 6 \cdot 5\end{bmatrix} \\= 49^{-1} \cdot\begin{bmatrix}44\end{bmatrix} \\= \begin{bmatrix}0.8979592\end{bmatrix}\end{gather}$$
Where X_P is the sub-matrix of X containing only the columns in the active set P. In our case, P = {2}, so β is:
$$\begin{gather}\beta =\begin{bmatrix}0 \\ 0.8979592 \\ 0\end{bmatrix}\end{gather}$$Still with me? We went from initializing zero weights (beta) for all 3 to now with the second model having weight of 0.89796
Step 4 Check For Negative Weights
$$\begin{gather}\text{If any } \beta_P \leq 0, \text{ calculate } \alpha \text{ else go back to step 1}\\\alpha = \min_{\beta_P \leq 0} \frac{\beta_{old}}{\beta_{old} - \beta_P} \\\text{If } \alpha < 1, \text{ update } \beta = \beta_{old} + \alpha (\beta_P - \beta_{old}) \\\text{Remove any variables from P where } \beta \leq 0 \text{ and return them to R}\end{gather}$$Since our weights $\beta$ cannot be negative, and if we hit a negative value, we want to shift all \(\beta\) by \(\alpha\) proportion of the difference and make the calculated negative weight 0 and adjust the other weights equally.
After the above, we go iterate until R set is empty. You get the point, instead of latex the entire calculation, letās use code to get to our answers.
P <- c()R <- c(1:dim(X)[2])beta <- rep(0, dim(X)[2])while (T) {# step 1: find gradient gradient <- t(X) %*% (y-X %*% beta)if (sum(gradient<=0)==dim(X)[2]) { print("all gradients are zero or negative, we have achieved optimality") ; break }if (length(R)==0) { print("R is empty") ; break }# step 2: check optimalitygradient_not_active <- gradientgradient_not_active[P] <- -InfP_x <- which(gradient_not_active==max(gradient_not_active))P <- c(P,P_x) |> unique() |> sort()R <- setdiff(R, P_x)# solve Pbeta_i <- betabeta_i[P] <- solve(t(X[,P]) %*% X[,P]) %*% t(X[,P])%*%yif (any(beta_i<0)) { print(paste0("negative weights: ",paste(beta_i, collapse = " "))) idx <- which(beta_i<0) beta_old <- beta[idx] beta_new <- beta_i[idx] alpha <- beta_old/-(beta_new-beta_old) beta_i_new <- beta - alpha*(beta-beta_i) beta <- beta_i_new |> round(digits = 4) print(paste0("new weights after setting negative weight as zero: ", paste(beta,collapse = " "))) } else { beta <- beta_i ; print(beta) }}## [1] 0.0000000 0.0000000 0.6721311## [1] 0.8683544 0.0000000 0.1810127## [1] "negative weights: 0.666666666666675 0.333333333333364 -7.105427357601e-15"## [1] "new weights after setting negative weight as zero: 0.6667 0.3333 0"## [1] "R is empty"beta## [1] 0.6667 0.3333 0.0000X %*% beta## [,1]## [1,] 1.99995## [2,] 0.99995## [3,] 4.99995Wow, it worked! Look at our weights (beta) and our final results! As suspected, column 1 and 2 will have the weights (more on column 1) and when combined our final numbers are quite close to our y, which is 2, 1, 5 . Awesome! Now, letās simulate more data and see if our code works and compare it with nnls package!
Letās Put Them All Together
Simulate Data
# labels/outcome/ynum_labels <- 1000label_range <- 1:5y <- sample(label_range, num_labels, replace=T)# X matrixnum_models <- 5X <- matrix(nrow = num_labels, ncol = num_models)for (i in 1:num_models) { sd <- sample(c(0.01,1,10)) for (j in 1:num_labels) { X[j, i] <- rnorm(1, mean = y[j], sd = sd)}}Alright, what we did above is basically simulated y and X
P <- c()R <- c(1:dim(X)[2])beta <- rep(0, dim(X)[2])while (T) {# step 1: find gradient gradient <- t(X) %*% (y-X %*% beta)if (sum(gradient<=0)==dim(X)[2]) { print("all gradients are zero or negative, we have achieved optimality") ; break }if (length(R)==0) { print("R is empty") ; break }# step 2: check optimalitygradient_not_active <- gradientgradient_not_active[P] <- -InfP_x <- which(gradient_not_active==max(gradient_not_active))P <- c(P,P_x) |> unique() |> sort()R <- setdiff(R, P_x)# solve Pbeta_i <- betabeta_i[P] <- solve(t(X[,P]) %*% X[,P]) %*% t(X[,P])%*%yif (any(beta_i<0)) { print(paste0("negative weights: ",paste(beta_i, collapse = " "))) idx <- which(beta_i<0) beta_old <- beta[idx] beta_new <- beta_i[idx] alpha <- beta_old/-(beta_new-beta_old) beta_i_new <- beta - alpha*(beta-beta_i) beta <- beta_i_new |> round(digits = 4) print(paste0("new weights after setting negative weight as zero: ", paste(beta,collapse = " "))) } else { beta <- beta_i ; print(beta) }}## [1] 0.9998029 0.0000000 0.0000000 0.0000000 0.0000000## [1] 0.5145682 0.0000000 0.0000000 0.4853314 0.0000000## [1] 0.3242697 0.3571370 0.0000000 0.3184922 0.0000000## [1] 0.2346782 0.2762402 0.0000000 0.2480602 0.2409658## [1] 0.1884031 0.2342831 0.1677427 0.2059978 0.2035208## [1] "R is empty"Letās look at our weights and RMSE
beta## [1] 0.1884031 0.2342831 0.1677427 0.2059978 0.2035208sqrt(mean((y - X %*% beta)^2))## [1] 0.004614622
Letās look at nnls package and see if we can the same result
model <- nnls::nnls(A=X,b=y)model## Nonnegative least squares model## x estimates: 0.1884031 0.2342831 0.1677427 0.2059978 0.2035208 ## residual sum-of-squares: 0.02129## reason terminated: The solution has been computed sucessfully.sqrt(mean((y - X %*% model$x)^2))## [1] 0.004614622
wow! Awesome!!! Looks the same or at least very similar. Alright, now weāre at least able to reproduce the nnls portion from scratch. Letās see if we can simulate a non-linear data and train with different models and see how our end result is!
Letās Super Learn this thing
Click below at code to expand for the entire procedures. We basically ran 3 different models in tidymodels, linear regression, xgboost, and random forest with recipe (y ~ .), not specifying any interaction/polynomial relationships for a simulated data below.
n <- 1000x <- rnorm(n)w <- rnorm(n, 0.5*x)y <- 0.2*x + 0.5*w + 0.2*x*w + 0.05*x^2
Made sure to set seed for reproducibility, create 5 fold for cross validation. Then extract all the prediction for validation sets from each models and stack them into X matrix. Then extract the RMSE from each models and stack them into metrics matrix. Finally, we run our nnls code above to get the weights and RMSE for super learner. We repeat this for 1000 iterations and log the results.
This cross-validation step is the defining feature of the Super Learner. By fitting each base learner on training folds and generating out-of-fold predictions, we obtain an unbiased prediction matrix that is then used to estimate optimal ensemble weights via NNLS.
codelibrary(tidymodels)library(future)library(furrr)# Set up parallel processingplan(multisession, workers = availableCores() - 2)# Define the function to run for each iterationrun_iteration <- function(i) { set.seed(i) n <- 1000 x <- rnorm(n) w <- rnorm(n, 0.5*x) y <- 0.2*x + 0.5*w + 0.2*x*w + 0.05*x^2 df <- tibble(x,w,y) split <- initial_split(df) train <- training(split) test <- testing(split) # preprocess rec <- recipe(y ~ ., data=train) # linear regression lr_spec <- linear_reg() wf <- workflow() |> add_recipe(rec) |> add_model(lr_spec) folds <- vfold_cv(train, 5) cv_results <- wf |> fit_resamples(folds, control = control_resamples(save_pred = TRUE)) cv_metrics <- collect_metrics(cv_results) |> filter(.metric == "rmse") |> pull(mean) cv_preds <- collect_predictions(cv_results) #xgboost xgb_spec <- boost_tree(engine = "xgboost", mode = "regression") wf <- workflow() |> add_recipe(rec) |> add_model(xgb_spec) cv_results <- wf |> fit_resamples(folds, control = control_resamples(save_pred = T)) cv_metrics2 <- collect_metrics(cv_results) |> filter(.metric == "rmse") |> pull(mean) cv_preds2 <- collect_predictions(cv_results) # random forest rf_spec <- rand_forest(mode = "regression") wf <- workflow() |> add_recipe(rec) |> add_model(rf_spec) cv_results <- wf |> fit_resamples(folds, control = control_resamples(save_pred = T)) cv_metrics3 <- collect_metrics(cv_results) |> filter(.metric == "rmse") |> pull(mean) cv_preds3 <- collect_predictions(cv_results) |> mutate(model = "rf") X <- cbind(cv_preds |> select(X1=.pred),cv_preds2 |> select(X2=.pred), cv_preds3 |> select(X3=.pred)) |> as.matrix() y <- cv_preds |> select(y) |> as.matrix() metrics <- cbind(cv_metrics,cv_metrics2,cv_metrics3) # nnls P <- c() R <- c(1:dim(X)[2]) beta <- rep(0, dim(X)[2]) while (T) { # step 1: find gradient gradient <- t(X) %*% (y-X %*% beta) if (sum(gradient<=0)==dim(X)[2]) { print("all gradients are zero or negative, we have achieved optimality") ; break } if (length(R)==0) { print("R is empty") ; break } # step 2: check optimality gradient_not_active <- gradient gradient_not_active[P] <- -Inf P_x <- which(gradient_not_active==max(gradient_not_active)) P <- c(P,P_x) |> unique() |> sort() R <- setdiff(R, P_x) # solve P beta_i <- beta beta_i[P] <- solve(t(X[,P]) %*% X[,P]) %*% t(X[,P])%*%y if (any(beta_i<0)) { print(paste0("negative weights: ",paste(beta_i, collapse = " "))) idx <- which(beta_i<0) beta_old <- beta[idx] beta_new <- beta_i[idx] alpha <- beta_old/-(beta_new-beta_old) beta_i_new <- beta - alpha*(beta-beta_i) beta <- beta_i_new |> round(digits = 4) print(paste0("new weights after setting negative weight as zero: ", paste(beta,collapse = " "))) } else { beta <- beta_i ; print(beta) } } rmse_superlearner <- sqrt(mean((y - X %*% beta)^2)) rmse_result <- if (sum(metrics < rmse_superlearner) >= 1) { "solo_better" } else { "superlearner_better" } model <- nnls::nnls(A=X,b=y) rmse_ours_nnls <- c(rmse_superlearner, sqrt(mean((y - X %*% model$x)^2))) same_weights_result <- if (sum(round(beta, 4) == round(model$x, 4)) == 3) { "same" } else { "not_same" } weights_log <- c(model$x, beta) return(list(rmse_log = rmse_result, same_weights_log = same_weights_result, weights_log=weights_log))}# Run with future_mapresults <- future_map(1:1000, run_iteration, .options = furrr_options(seed = TRUE), .progress = TRUE)Letās Compare RMSE of Solo models vs Super Learner models
codelibrary(tidyverse)# Extract resultsrmse_log <- map_chr(results, "rmse_log")same_weights_log <- map_chr(results, "same_weights_log")weight_logs <- matrix(NA, ncol = 6, nrow = 1000)for (i in 1:1000) { weight_logs[i, 1:6] <- results[[i]]$weights_log}plotrmse_log <- tibble(rmse=rmse_log) |> ggplot(aes(x=rmse)) + geom_bar() + theme_bw()
Wow, look at that! superlearner/ensembled model does appear to have better RMSE compared to solo models! Letās take a look and see if our noob nnls from scratch is comparable with nnls package.
Comparing Our NNLS to nnls package
codeplotsameweights <- tibble(same_weights=same_weights_log) |> ggplot(aes(x=same_weights)) + geom_bar() + theme_bw()

Wow, most of the weights are the same if we round up to 4 digits! Letās check on the ones with difference, is it REALLY that different?
codeplotdiff123 <- weight_logs |> as_tibble() |> mutate(diff1 = round(V4-V1, 4), diff2 = round(V5-V2, 4), diff3 = round(V6-V3, 4), sum_diff = diff1+diff2+diff3) |> filter(sum_diff != 0) |> pivot_longer(cols = c(diff1,diff2,diff3), names_to = "diff", values_to = "values") |> ggplot(aes(x=values,fill=diff)) + geom_histogram(position = "dodge2") + theme_bw()

This makes sense, most of the differences are between xgboost (diff2) and random forest (diff3), as our linear regression (diff1) model without correct specification probably wonāt have a whole of contributions, hence if there is a difference between our algorithm and nnls, it would be minimal (center in red). It also make sense that if there is a difference in xgboost or random forest model, we would see different weight on the other model contribution. Now the question is, with these weight differences, does it make a huge difference in RMSE? I suspect not so much.
rmse_compare <- matrix(NA, ncol = 2, nrow = 1000)for (i in 1:1000) { rmse_compare[i,1:2] <- results[[i]]$rmse_ours_nnls }plotcompare <- weight_logs |> as_tibble() |> mutate(row = row_number()) |> mutate(diff1 = round(V4-V1, 4), diff2 = round(V5-V2, 4), diff3 = round(V6-V3, 4), sum_diff = diff1+diff2+diff3) |> filter(sum_diff != 0) |> left_join(as_tibble(rmse_compare) |> mutate(row = row_number()), by = "row") |> mutate(V1.y = round(V1.y, 4), V2.y = round(V2.y, 4)) |> mutate(check = case_when( V1.y == V2.y ~ "same", V1.y < V2.y ~ "our_nnls_better", V1.y > V2.y ~ "nnls_package_better", TRUE ~ NA_character_ )) |> ggplot(aes(x=check)) + geom_bar() + theme_bw()
lol, nnls package clearly is better than our home-grown algorithm! But by how much?
plotdiffrmse <- weight_logs |> as_tibble() |> mutate(row = row_number()) |> mutate(diff1 = round(V4-V1, 4), diff2 = round(V5-V2, 4), diff3 = round(V6-V3, 4), sum_diff = diff1+diff2+diff3) |> filter(sum_diff != 0) |> left_join(as_tibble(rmse_compare) |> mutate(row = row_number()), by = "row") |> mutate(V1.y = round(V1.y, 4), V2.y = round(V2.y, 4)) |> mutate(diff_rmse = V1.y - V2.y) |> ggplot(aes(diff_rmse)) + geom_histogram() + theme_bw()

ā
Itās really not that much different! Letās find the max.
weight_logs |> as_tibble() |> mutate(row = row_number()) |> mutate(diff1 = round(V4-V1, 4), diff2 = round(V5-V2, 4), diff3 = round(V6-V3, 4), sum_diff = diff1+diff2+diff3) |> filter(sum_diff != 0) |> left_join(as_tibble(rmse_compare) |> mutate(row = row_number()), by = "row") |> mutate(V1.y = round(V1.y, 4), V2.y = round(V2.y, 4)) |> mutate(diff_rmse = V1.y - V2.y) |> pull(diff_rmse) |> max()## [1] 9e-04
## [1] 9e-04
š„¹ Does that mean our home-grown algorithm works just as well? You be the judge. Let me know if this is due to pure luck!
Opportunities for improvement
- will try multicore sometime in the future, is it really faster than multisession?
- need to learn/figure out FAST nnls algorithm which I believe
nnlspackage uses - need to venture more in parallel computing
- compare with the actual
SuperLearnerpackage
Lessons learnt
- learnt to build Super Learner using non-negative least square model
- learnt Lawson-Hanson algorithm and how itās implemented, compared with
nnlsand results not too shabby! - learnt some basics of parallel computing
If you like this article:
- please feel free to send me a comment or visit my other blogs
- please feel free to follow me on BlueSky, twitter, GitHub or Mastodon
- if you would like collaborate please feel free to contact me
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Tried building Super Learner from scratch to understand whatās happening under the hood. Walked through the NNLS algorithm step-by-stepāturns out ensembling models may beat solo models! Our homegrown version? Surprisingly close to nnls package results
But, does it really work in real life? 