How is it that this problem, with its 21 data points, is so much easier to handle with 1 predictor than with 16 predictors?

Statistical Modeling, Causal Inference, and Social Science 2025-11-14

Here’s a story that appears on pages 309-310 of Active Statistics:

Many years ago we taught a course in statistical consulting. The consulting was done by graduate students in pairs: each pair had open office hours once a week, clients would come in to discuss their problems and then were told to return in a week, then each week we would have a meeting with all the students where we would go over the consulting problems that had come in, which would prepare them for their followup meetings.

Lots of interesting problems would come in. One week, a pair of students reported that someone had shown up who was studying the efficiency of industrial plants. The researcher had data on 21 factories, and for each of them she had a measure of efficiency and 16 predictors—different variables that might be predictive of that outcome. She wanted to use these data to see which of these factors was most important. We’re sorry, but we have no records from this class, so many details are missing—we’re reconstructing this from memory.

But one thing we do remember are the numbers: 21 data points, 16 predictors.

The first problem here is to ask what can be done with these data. It’s not an easy question. Indeed, it might seem ridiculous to suppose that you could tease out a regression relationship among so many predictors with so few observations. And this is without even getting into potential interactions (16*15/2 two-way interactions and so forth) or the difficulties of causal identification from observational data. If students cannot come up with any ideas, the instructor should push them in another way, by asking what decisions they might make based on these data, if they were designing this sort of industrial plant.

When this example came up in our consulting class years ago, one of the other students said that he remembered that researcher from the previous semester: she’d come by with 15 data points and 16 predictors, and he and his partner had told her that, with fewer data points than predictors, they couldn’t help her. In the meantime this researcher had gathered data from 6 more plants and was emboldened to return.

Fine. Laugh all you want. But . . . there are things that can be done even using this small dataset. Think about it this way: suppose the researcher had come in with 21 factories and just one predictor. Then you could do something, right? You can make a scatterplot of the outcome vs. the predictor, you could run a regression predicting the outcome from this one variable. You can potentially learn a lot from 21 data points, or even from 15. Even if all you learn is that none of the 16 available predictors is by itself a strong indicator of the outcome, that still is relevant information.

This is a problem with a large number of predictors compared to the number of data points, which is a setting where classical least-squares regression will not work, and some sort of regularization would be necessary, as is done in various Bayesian or machine learning approaches to statistics. It is an example where if we think carefully about our inferential goals, we realize we can learn something useful from our data—just not the “statistically significant” comparison that we’re used to looking for.

Here, though, I want to look at the problem in a slightly different way. Instead of considering how to analyze these data, let’s ask the following question.

How is it that this problem, with its 21 data points, is so much easier to handle with 1 predictor than with 16 predictors?

Sure, 21 data points and 1 predictor is a cleaner problem than 21 data points and 16 predictors. But more data should be better, no? A 21 x 16 matrix of predictors has a lot more information than vector of length 21—especially given that those 21 numbers exist as a column within that larger matrix.

So, as a statistician, my usual answer to this question is that the dataset with 16 predictors contains more information, and we just have to move beyond simple least-squares and use more advanced methods to analyze the data.

But then I was thinking more about the problem, and I realized something.

Suppose we think of the 16 predictors as ordered, so that the two options are: (a) predictor #1, or (b) predictors #1,2,…,16. In that case, option (b) is clearly better: it includes additional information, and in a regression context you can get as close to any model based on option (a) by simply putting very strong zero-centered priors on the coefficients for predictors #2,…,16.

But now suppose the 16 predictors are unordered. In this case option (b) could be better than option (a)—or it could be worse! Option (b) contains the additional information from the 15 other predictors, so in that way it’s better. But, in this unordered-predictor scenario, option (b) excludes an important piece of information compared to option (a), which is the label of which is the one predictor that was included in that simpler scenario.

We don’t usually think of the order of predictors as containing information, because our usual models for regression are invariant to the indexes of the predictors. That is, standard methods are exchangeable—that’s the term for models or procedures that are invariant to indexing, as discussed in chapters 5 and 8 of Bayesian Data Analysis.

In real life, though, we often assemble information sequentially, and in a setting with 16 predictors there might well bean ordering, not strictly from most to least important, but something like that. It makes sense to include this implicit ordering information into our statistical procedures and models, and it’s a flaw in our default procedures (including in my own textbooks!) that we don’t.