“Why don’t machine learning and large language model evaluations report uncertainty?”

Statistical Modeling, Causal Inference, and Social Science 2025-02-22

Ilan Strauss and Tim O’Reilly ask:

Why don’t ML and LLM model evaluations report uncertainty? Rarely see an interval of some kind.

– Because the models are too big (LLMs)?

– Or because their ML metrics (Accuracy, recall, precision) are assumed to be sufficient for taking into account uncertainty in predictions stemming from the model’s training data, i.e. can the model generalize / predict new data points?

They continue:

LLM behavior is also inherently uncertain. The model’s responses are highly sensitive to factors like the query, hyperparameters, and context, all of which introduce variability in a model’s outputs. . . LLMs seem to be computationally deterministic in their outputs (even if practical stuff complicates this): Given the same input and conditions, the model should generate the same probabilities for the next token. The variability we see in outputs stems largely from the sampling methods applied on top of these probabilities, such as top-k sampling or temperature sampling. These techniques introduces randomness, producing different outputs for the same input. But even without this sampling layer, uncertainty should persist in LLM evaluation results because it’s impractical to test all possible model input-output combinations. . . .

Calculating and showing model uncertainty usually comes by providing an interval . . . So, why the omission by OpenAI of uncertainty from most of its model evaluations? Maybe computer scientists aren’t always familiar with common statistical practice . . . the leading AI textbook by Stuart Russell and Peter Norvig (4th edition) [has] an entire chapter on “Quantifying Uncertainty”, but devoted largely to uncertainty facing AI in the external environment. . . .

I don’t have any answer of my own to the question, “Why don’t machine learning and large language model evaluations report uncertainty?”, for the simple reason that I don’t know enough about machine learning and large language model evaluations in the first place. I imagine they do report uncertainty in some settings.

My general recommendation to people running machine learning models is to replicate using different starting points. This won’t capture all uncertainty, not even all statistical uncertainty—you can see this by considering a simple example such as linear least squares, which converges to the same solution no matter where you start it, so in that case my try-different-starting-values trick won’t get you anywhere at all—; rather, I think of this as a way to get an approximate lower bound on Monte Carlo uncertainty of your output. To get more than that, I think you need some explicit or implicit modeling of the data collection process. An explicit model goes into a likelihood function which goes into a Bayesian analysis which produces uncertainty. An implicit model could be instantiated by repeating the computation using different datasets as obtained by cross validation or bootstrapping.