Validating language models as study participants: How it’s being done, why it fails, and what works instead

Statistical Modeling, Causal Inference, and Social Science 2025-12-19

This is Jessica. Earlier this year, I started paying attention to proposals to use LLMs to simulate participants in surveys and behavioral experiments. The idea is that LLMs can be prompted with experiment or survey instructions and a participant persona (e.g., demographic description), making it possible to simulate target human samples without the cost and headache of recruiting real people. A number of papers have pointed to promising results, like where LLM results are moderately to highly correlated with human study results, to argue that they could transform behavioral science: by increasing sample sizes, generating missing counterfactuals, allowing us to learn about hard-to-reach populations or ethically fraught situations, etc.

The obvious elephant in the room is validation: how do you establish that conclusions drawn about human behavior from analyses that substitute or augment human data with LLM outputs are valid, in the sense that using LLM outputs doesn’t systematically bias your ability to estimate the target human parameter (mean effects, regression coefficients, etc.)? Many papers on this topic deal with this in a loose, heuristic way. For example, authors will demonstrate partial replication of some human results with LLMs, then go on to argue that LLMs could be used to approximate human behavior more broadly in that domain. Some attempt to codify this kind of heuristic validation.

So we decided to write something specifically about validating LLM study participants: what the landscape of approaches people are taking looks like, and of these, which meet minimum requirements for getting valid downstream parameter estimates. David Broska, Huaman Sun, Aaron Shaw and I write:

A growing literature presents large language model systems (LLMs) as a transformative data source for simulating human behavior in experiments. However, to arrive at credible conclusions when substituting these AI surrogates for human participants requires showing that LLMs can approximate the target human responses or parameters. We characterize approaches to validation in the literature. A heuristic approach argues for generalization based on strong resemblance between humans and AI surrogates on a subset of tasks, often in combination with ex-ante “repair strategies” designed to reduce LLM-contributed biases through prompt engineering or model fine-tuning. However, the lack of accounting for remaining bias precludes the researcher from attaining basic validity guarantees customary for confirmatory research. In contrast, a statistical calibration approach uses auxiliary human data and statistical adjustment to account for discrepancies between observed and predicted responses. Calibration approaches help ensure that use of AI surrogates does not mislead researchers who claim to ultimately target human behav- ior. They are not, however, a panacea; even when assumptions hold, benefits may be modest as a result of high variability in behavioral targets. Restricting LLM use to predicting effects in discovery-oriented research avoids validation challenges, but requires caution in interpreting effect sizes. We propose ways that LLMs could help researchers address pervasive blind spots in design and analysis if used instead to challenge researchers’ expectations about effect size and analysis.

Heuristic validation

We first characterize ways that authors are using a “validate-then-simulate” pattern to demonstrate face validity–by showing that direction or significance of effects is preserved, or that LLM results are highly correlated with human results, or that it’s hard to statistically distinguish LLM responses from human ones, etc. The problem is that arguments based on face validity can’t provide the kind of guarantees we typically expect of inferential methods. We don’t expect to be able to present study results as causal estimates if we haven’t attempted to meet basic independence conditions, or to interpret OLS coefficients if we’re not willing to make assumptions about linearity and residuals. Similarly, we should not be content to trust behavioral estimates that involve LLMs unless either a) sufficient conditions for valid inference have been demonstrated, or b) authors have explicitly taken steps to account for bias in downstream analyses.

Why heuristic validation doesn’t work

We summarize what conditions would have to hold for direct substitution, based on recent theoretical treatments. For example, Ludwig et al. (2025) use an econometric model to define two conditions that should hold for directly substituting LLM labels for human labels to serve as a general method. In a nutshell, unless you’re willing to make strong assumptions about how the prompts (e.g., experimental scenarios or survey instruments) that you’re studying are sampled from the space of all relevant prompts, you need to ensure 1) that there is no leakage between the experimental scenarios or survey instruments you’re studying and the model training data, and 2) that the relevant conditions (i.e., moment criteria) that need to hold for your analysis to identify the target human parameters still hold when you plug-in LLM responses instead. Essentially, the potential for LLM errors to be correlated with covariates you care about means that even if the LLM shows very small bias in predicting the human responses, your estimates of population means, regression coefficients, or other target parameters could still be pretty far off.

Calibration as alternative

Consequently, showing that an LLM adequately replicates target human behavior on some subset of tasks isn’t sufficient evidence for expecting generalization to related tasks. But that’s ok–a number of recently proposed approaches demonstrate calibration approaches, where a small amount of jointly (human and LLM) labeled data is used to adjust an estimator so that LLM responses can be integrated without biasing downstream parameter estimates. For example, previously we discussed my co-author David Broska’s work on Mixed Subjects Design, which uses a prediction-powered inference approach to define a hybrid estimator. The PPI estimator is centered on the same parameter targetted by the human subjects estimator (e.g., a population mean, a regression coefficient, etc.) but when properly tuned, is designed to be at least as precise based on the inclusion of the LLM responses. Other approaches to augmented estimators, like design based supervised learning, make slightly stronger assumptions about how the human-labeled instances are sampled.

Of course, the existence of these approaches doesn’t automatically mean that it’s worth your time to figure out how to work with LLM subjects in your studies. That’s a harder call. There are lots of proposals around what we call “repair strategies”–tips on prompt engineering, model choice, model fine-tuning, etc.–to improve the fidelity of LLM predictions to human responses. But we should also keep in mind that how much more we can learn about humans by incorporating LLM subjects will depend in part on how noisy the target human behavior is. In fields like psychology, the noise ceiling (i.e., maximum predictive performance any model could reach) is low, and so we shouldn’t be too surprised if adding large numbers of LLM yields only modest gains. This tracks with what we see in the handful of existing demonstrations of calibration approaches like PPI to human subjects data: gains in effective sample size are only up to about 15%, even when the amount of LLM simulated data is much much larger than the human sample.

Reserving LLMs for discovery of effects, or to act as design adversary

Another proposal that has been circulating is to rely on LLM simulations purely to aid discovery of effects: simulate a bunch of experimental scenarios to figure out where there’s an association, or stress test your study instrument to improve question wording or other components. There’s nothing wrong with this, but we should keep in mind that many behavioral studies are targeting small effects. Again, just a little bit of bias can make effect estimates from LLM simulations misleading if the bias correlates with covariates you care about. And this is not at all implausible: it happens, for example, when LLMs are systematically more accurate for certain types of participants, or for certain scenarios that are closer to what’s in the training data.

Something we propose that I haven’t seen come up is the idea of using the LLM to challenge or interrogate your expectations during design and interpretation of study results. There’s a lot of focus on using LLM subjects to make predictions about what’s probable, but they could also be used simply to explore possibilities, and help you better understand the kinds of implicit commitments you’re making in design or interpretation of results. For example, fake data simulation is a powerful practice for designing better experiments, but I get the sense that many human subjects researchers don’t do it by default. As a natural language programming interface LLMs could make this a lot easier – you ask them to simulate data under different expectations about effect magnitude and variance, to help you reason about sample size and other aspects of study design. Inspired by Causal Quartets, you could employ them to help you think through implications of heterogeneity for the estimate you’re using to power your study or that you observed when you ran the study with humans. Similarly, they could help by generating deviant but plausible data scenarios for stress testing your planned modeling approach. It’s a slight shift in framing to see them as tools for stimulating imagination and interrogating assumptions instead of as soothsayer, but could be a productive one.