LLMs as behavioral study participants

Statistical Modeling, Causal Inference, and Social Science 2025-05-29

This is Jessica. There is lots of talk these days about how generative models will transform social science. Think using LLMs to simulate human behavior for purposes like designing and conducting social science studies, marketing research, or testing social systems, persuasive messaging, interfaces, etc. Much of this is still contentious, but there is some consensus emerging around how LLMs can help versus hinder progress in social science research.

Here I’m mostly going to consider using generative models in experimental studies of human behavior. Initially I started paying attention to this because of the potential trainwreck vibes from a statistical validation perspective… social scientists start relying on simulations with models that even the computer scientists don’t fully understand to learn about people, what could go wrong? There is lots of room for overinterpreting noise. But given the value (economic, epistemic, etc) of being able to predict what people will do, I think it’s worth considering what a rigorous methodology is for this kind of simulation science. As I write this, I’m on my way to a workshop on LLM-based behavioral simulations which will hopefully give me more food for thought on what this does / does not look like.

At a high level, the emerging consensus is that LLMs may never make good substitutes for human study participants in the sense of letting us learn new things about human behavior without having to deal with humans. How do you discover new facts about the cognitive or social world until you’ve attempted to understand how generative models align with human behavior for those research questions? Even if your LLM-based simulation ends up aligning well with human results, you will ultimately have to collect enough human results to validate that you have a good simulation. It’s like trying to figure out how to generate synthetic data from real survey results before those results have been analyzed: you won’t know what is most important to preserve. At a more basic level, how do you expect to accurately simulate conditional distributions you’ve never observed? LLMs may do well on imputation-style problems (like filling in missing answers to some survey questions, or inferring what responses would have been to questions that hadn’t been asked yet) but we should not necessarily expect good performance when we try to extrapolate to new scenarios.

One problem is that there will be biases that can compound across simulations. Similar to how having a huge sample size doesn’t necessarily give you better estimates in behavioral science when there’s selection bias operating, the fact that LLM simulation results often differ in some ways–and in particular, tend to produce more “extreme” results than humans–leaves the risk that we mislead ourselves. For example, LLM simulations often result in lower variance and diversity and more pronounced stereotypes than human results. Many of the studies I’ve seen people try to replicate with LLMs so far focus on replicating average treatment effects that are assumed to be constant, usually with representative U.S. samples. There’s some evidence that in such cases, effect sizes identified using generative models can be highly correlated with those based on humans, including for studies that could not be in the models’ training data. But when the focus is on heterogeneous effects or particular groups, things may get more distorted. In general, bias is challenging given that when we run experiments we are typically trying to understand the “edges” of some effect, e.g., by controlling for confounders while imposing interventions that we think will maximize the target difference. This isn’t to say that some progress can’t be made, e.g., by doing a bunch of generalization tests to identify scenarios where generative agents provide a useful impressionistic summary of human behavior, and being careful not to take them too far out of that neighborhood. And bigger/newer models appear to reduce distortions. But for brand new scenarios we’re stuck with heuristic approaches.

There is some interest in using LLMs as a first pass tool to prioritize existing results in need of replication. I’m not very enthusiastic about this either, given the shaky foundations of replication as a hallmark of good science. The LLM-based replications of social science studies I’ve seen so far are a mixed bag, with the same kinds of arbitrariness you see in human-based replications when it comes to determining how well a behavior has replicated (e.g., asking is the direction of the effect the same while ignoring big differences in magnitude). Some interesting questions come up about what it means to perform a valid replication with LLMs given their propensity for memorization, and to validate that an agent’s actions are a meaningful simulation of human behavior. Do we need to establish that an agent’s actions are self-consistent or “intentional” in the same way that human actions can be before we can trust them? For example, I’ve seen authors asking LLM agents to explain their behavior as evidence that a replication is trustworthy, similar to how you might elicit a human participant’s reasoning. This kind of thing is a big fraught pile of worms. There are also many additional degrees of freedom relative to human replications because you often have to adjust the experimental procedure to get reasonably robust results from LLMs, due to sensitivity to prompt variations and strategies. There is some advice emerging on how to identify more consistent prompts, prompt with demographic attributes, aggregate simulation results, etc. See e.g., here.

Where LLM simulations have clearer promise is in exploratory theory building. John Horton’s (now old) 2023 paper describes how LLMs can play the role of economic models, where before conducting experiments you use them to understand the space of possible effects under particular assumptions about behavior. For example, maybe you want to understand how agents that are myopic in a particular way solve a negotation problem, to help you brainstorm what you might see in a study with humans. Or maybe you want help thinking through the range of effects you might see from different types of participants in a study you’re designing, to help you with sample size calculations. If we agree that thorough piloting is generally a good thing for behavioral research, then LLMs can be helpful in this regard, and potentially even more so when they are also used more directly to assist brainstorming, e.g., by generating hypotheses about important covariates.

There’s a tension in some of this literature between focusing on how well LLMs capture idealized behavior (e.g., economic rationality, predicting future events) versus how well they can be used to simulate human behavior with all of its biases. There are downstream use cases for both. If I’m having a generative agent negotiate deals or manage my wealth on my behalf, I want them to be more strategic and rational than I am, whereas if I am a company trying to understand what my customers value most in my products, I will prefer realism. The LLM-as-idealized-human approach is cleaner to study, as we have a better sense of what we’re looking for, but also potentially more limited (since there is a lot we can do with non-generative e.g., “rational” models in this regard).