Delicate language for talking about statistical guarantees

Statistical Modeling, Causal Inference, and Social Science 2024-12-20

This is Jessica. As I’ve been reading papers on uncertainty quantification in machine learning (related to topics like conformal prediction or calibration), I’ve been reflecting on the language choices that authors make.

It started when I was reading a paper about an approach that uses set-aside data labeled with human reports to learn a threshold on a function that predicts the “alignment” (e.g., similarity) of predictions with those reports. They do this to ensure that new predictions are sufficiently aligned with the human reports. There are many new methods for quantifying prediction uncertainty or controlling prediction risk in ML that have a similar flavor — they involve learning either thresholds (e.g., conformal prediction) or adjustments (e.g., posthoc calibration algorithms) on held-out calibration data that are then applied to predictions. Some of them come with “statistical guarantees”: If the data used for calibration is a good approximation of future data (e.g., is i.i.d. or at least exchangable), then we can expect calibration in the future.

Anyway, I paused when I got to this sentence:

new reports are selected if their predicted alignment scores exceed a data-driven threshold, which is delicately set with Dcal such that the FDR is strictly controlled at the desired level

because I found the choice fo the phrase “delicately set” interesting. It seems like a good way to describe this kind of procedure: naturally there are assumptions on which the expectation about controlling the FDR rate depend, like statistical exchangeability between the calibration data and new instances, that might not hold. By using a term like “delicately” the authors signal some fragility.

It makes me think of all of the other words that could be used to signal tentativeness or epistemic uncertainty when discussing such methods: “tenuously,” “gingerly,” “warily,“ or, if you want to sound like you just studied for the GRE, “frangibly.” Maybe a bit of poetic license would not be a bad thing when describing our expectations about such methods. It would certainly make for a more entertaining reading experience.

However, the fact that above statement continues “… such that the FDR is strictly controlled at the desired level” makes it sound a lot less fallible. This phrasing is more typical of the unconditional way in which such methods are often described.

Here’s another example, where the guarantee is about human behavior:

we develop an efficient and near-optimal search method to find the conformal predictor under which the expert is guaranteed to achieve the greatest accuracy with high probability.

But can we ever really guarantee that future human behavior will remain consistent with the past?

As I was writing this post, I noticed that Andrew has previously blogged on his dislike of the term “guarantees” because it hide assumptions. I agree that it’s problematic to talk about guarantees unconditionally, as if they cannot be violated even when the methods are applied in practice. This is why the unqualified statements about calibration being sufficient for good decisions bug me so much. Andrew’s post concludes by saying that it’s not so bad to talk about guarantees if you’re writing to a statistics and ML audience, since they may expect that.

But the more time I spend reading papers involving uncertainty or calibration guarantees, the more I find myself wishing there was a little more care taken in how their statistical properties are discussed, particularly when talking about practically-oriented methods that assume future data is like past data. Having poked around recent papers in domains outside of CS on topics like calibration and conformal prediction, it seems that the nuance is not necessarily obvious when people go to apply the methods in practice. So these days I’m trying to be more careful with my own word choices. For example, beyond trying to avoid flagrant use of the word guarantee, add the word “expected” more often. E.g., “under which the expert is expected to achieve the greatest accuracy…” or “delicately selected so that the FDR is expected to be strictly controlled at the desired level.” It’s minor but I think it better emphasizes that our assumptions may not always don’t hold, because they were unrealistic and/or because in practice we’re dealing with singular events rather than hypothetical replications.

Another idea would be to deliberately choose weaker synonyms for guarantees, like “this method promises that the expected FDR will be less than…” Sounds a little odd. But maybe it’s a good thing for descriptions of methods to take the reader back to childhood when their so-called friend “promised” not to tell on them or their parent “promised” not to snoop in their room! Or as Paul Alper suggests, say “statistical oomph” instead of guarantees, which I think captures the same vibe of excitement and clout but sounds so much sillier.