Ecologists’ endless quest for automatic inference

Statistical Modeling, Causal Inference, and Social Science 2025-05-28

This post is by Lizzie.

At the end of a recent course I taught on Bayesian approaches (which reminds me I should blog an update on that) a student asked ‘so when do we divide up our data into test and training?’ This stopped me a little as the whole course was on a workflow approach to science and stats that I hoped hammered home how to gain mechanistic insights from simulated data, preparing you for more insights using retrodictive checks on a model fit to your empirical data, etc.. I was on the spot suddenly realizing some gaps and failures in my course content. I also should not have surprised, as ecologists are going in big time on machine learning (are there other uses for test/training data? Yes, but that’s the dominant place this language is in use in my field now, IMHO), and we (I) don’t step back and teach the different approaches.

In discussing this with a stats colleague recently he mentioned the endless search for automatic inference. `Feed in data, pull crank, get scientific inference.’ It’s the opposite of the workflow to me. I also think it’s not going to work well, but it’s clearly the dream, and an alarming percent of ecology is devoted to it, without even knowing it.

Machine learning is the new best hope of automatic inference for ecology (and a lot of other fields) without anyone seeming to notice what they’re not getting. It’s amazing to me how many students seem blithely unaware of what machine learning is going to give you — (good) predictions for out-of-sample data, but a difficult time finding interpretable parameters and all the science that can go with them. (And, yes, I know some of the machine learning approaches are working on changing this.) So they see it as the inference approach.

The previous best hope of automatic inference was model comparison (LOO is the new magic, AIC was a big — BIG — hit, before that was stepwise regression with an alarming number of ecologists never learning any potential for problems with stepwise regression, but I digress) and it’s still running strong in some circles. Fit 6 or 600 or so models and compare them to see which is best. In my area, the models balloon since we have no idea what climatic driver to include. For example, I think water matters to trees growing outside, so for a precipitation variable, should I use total precipitation? Our maybe just during the growing season? Or, wait, maybe divide up growing and non-growing season. But then for the non-growing season, should I use snow depth? Snow water equivalent (SWE)? This is so hard, and there’s no clear answer.

Automatic inference to the rescue! You can put them all in with model comparison, including a suite of possible interactions, and see which ones really matter. Yay!

Did this work? Not at all if you ask me. I recently saw a tree ring talk that did this but you can tell the best fitting model actually made no biological sense after they thought about it more, so they presented the ‘second best model.’ And I am quite sure the second and third best model were pretty similar in any comparison metric you wanted to throw at them and they might have had really different answers to how the world works. (Ecologists have tried one way around this — model averaging, which I don’t think offers much either.) I am not sure why everyone is doing this other than that (1) we have all tacitly agreed it’s okay and (2) the other option seems harder, more uncertain and maybe we have not all tacitly agreed it’s okay.

What have we never gotten out of this as best I can tell: (a) We start to see new patterns in what matters in these model comparisons and say, ‘hey — all this work together really shows we should focus on SWE in this context. Thank goodness we did model comparison as there is no other way we would have figured this out.’ (b) We use something we learned in model comparison to design an experiment that teaches us something new. Like, ‘wow, I never thought extreme heat in August would be so important, I will now set up an experiment to test the role of extreme heat in August. I am so glad I put that predictor — and extreme heat in every other month and in 3-month windows — in my model so I could find this out.’ (c) The feeling of joy at saying, ‘look at my minimum adequate model! This is great and so helpful.’ We never get these things because the results are almost always a mess. We all know this as best I can tell so we don’t even look closely at them as reviewers any more.

What’s the other option?

The other option to me is that you pick your few best-guess damn variables — the ones you can make predictions about and describe the functional relationship of them to your response variable(s) and you put those in your model. Maybe you fit a few models, but not endless models. In my experience, the first step in this process alone (picking those variables) gains me way more insights than any model comparison ever has. Why? Because it’s the opposite of automatic inference. It requires me to think.

What’s the downside of this other option? One would be that we pick the wrong predictors and never see that amazing predictor we would have just tossed in on model comparison. But given where 20+ years of model comparison has gotten us I am discounting this possibility. The other — and this is what students in my classes are really worried about — is that we don’t all tacitly agree this is okay. Many students I suggest this to don’t think it’s okay. They see how widespread model comparison and its ilk are and worry they cannot get published without it. They aren’t even trained in how to pick those variables.

We’re so over the top on automatic inference we don’t even train our students to be prepared for anything else. And worse yet, we tell them they’re doing (good) science.

With machine learning* we’re slipping even further away from science and our training is getting even worse as best I can tell. Students at UBC in data science learn to ‘tidy’ data as though there is no domain expertise in this process. ‘Tidy’ means removing outliers, gap filling and other things that horrify me to see students learn in their first term. How on earth do they know what an outlier is when they don’t even know what the data are? After this they learn random forests and some simple neural nets. Science done.

What’s the solution? I desperately hope people smarter than me are working on this question. One answer is obviously raising our standards and discounting work that doesn’t really give us much from whatever model comparison they used. Another is better training — I think we all need to admit that training has got to change with machine learning on the rise. A lot of students I work with now only take data science — they learn only machine learning and don’t know what a regression is or think it is anything they use. They need to see how interconnected all the inference methods are and what aims each one works well on for now (and not) and be prepared that that might change. This seems tractable. What seems less tractable is better training in science — training students to know there’s no automatic inference for science and getting useful insights is actually messier, harder, and involves more uncertainty than most people tell you (but, if you ask me, it’s also a lot more fun).

*We’re somehow also now calling most of machine learning ‘AI’ in ecology. Are other fields doing this? Why (I mean, other than wanting to sound like you’re doing the absolute coolest, most cutting edge thing)?