Let’s analyze how we analyze!

Statistical Modeling, Causal Inference, and Social Science 2025-03-25

This post is by Lizzie. I thought of calling this post also: Lynx, hares and the utility of many-analyst studies, but I thought that schmeared stuff more than I mean to. I also took the photo from Paris last fall. Why a photo of Paris? Why not.

At an evening discussion event recently I made a passing comment about how I wish ecology knew more assuredly what causes the lynx-hare population dynamics (including their magical Lotka-Volterra cycles). Almost immediately several folks swept in that how could I think this when we *do* know. And then they proceeded to each share a different mechanistic (if you will) model, including:

– Maternal effects such that scared bunnies (aka hares) keep hiding out from lynx well after lynx population numbers have plummeted. – Something to do with willow (the trophic level below the hares). – Disease. – Someone mentioned the moon (but then someone emailed me later about sunspots, maybe the moon was the sunspots misremembered?). – And (I think) more hypotheses that I cannot now recall.

I was not surprised by this. One reason for this is that I have tried it before and got a similar set of assured and wildly diverging answers. The other is that ecology seems to me an inchoate field where we’re still struggling for general theories and how to sort out what is going on. I like to think we’re making it progress, but I am rather sure that it is currently slow.

This all brings me vaguely around to a recent paper that a reader pointed Andrew to and he then pointed me to it, which is a new many-analyst study by Gould and colleagues (lots of colleagues!) ‘Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology‘.

As the title suggests the paper gives the same data and research questions to a set of teams (who signed up to do this, and be co-authors for their work) and then sees how different the ‘answers’ are. I say ‘answers’ because obviously a pain-point in this sort of study is how to decide what to extract from any analysis and deem an answer. The authors tried to think about this in advance, as they pre-registered their study, but I was amazed at how painful I found both the presence of the pre-registration — or, to be more specific: the continual side bars on ‘deviations from pre-registration’ — and sifting out what the authors themselves were trying to tell me. Before I get to the latter though, here’s my favorite deviation:

Some analysts had difficulty implementing our instructions to derive the out-of-sample predictions, and in some cases (especially for the Eucalyptus data), they submitted predictions with implausibly extreme values. We believed these values were incorrect and thus made the conservative decision to exclude out-of-sample predictions where the estimates were > 3 standard deviations from the mean value from the full dataset provided to teams for analysis.

I only skimmed the paper, but I think the abstract captures some of what confused me:

For both datasets, we found substantial variation in the variable selection and random effects structures among analyses, as well as in the ratings of the analytical methods by peer reviewers, but we found no strong relationship between any of these and deviation from the meta-analytic mean. In other words, analyses with results that were far from the mean were no more or less likely to have dissimilar variable sets, use random effects in their models, or receive poor peer reviews than those analyses that found results that were close to the mean.

This led me down a brief path of skimming some other many-analyst studies or viewpoints (and the religion one cited within). At the end of the path I realized that these studies are not simply trying to point out a potentially concerning level of variation in the answers obtained from different teams using the same data for the same question, but something more.

Some are suggesting this should be a new way to do science — as if doing this will give us more confidence in the answers, while others (including Gould et al.) were doing something else: trying to find out which types of analyses give ‘better’ answers. The authors don’t clearly say they’re doing this (at least not a quick skim) but why else would so much of the paper include peer reviews of the analyses or dissections of whether the presence of ‘random effects’ tend to give more similar answers (I felt there was someone who either thinks hierarchical models are ‘better’ or has heard that and thinks otherwise behind this particular analysis).

Either one of these aims makes me want much more than any of these studies are currently offering. For the former (‘let’s add this to how we do science’) I wanted more information on how exactly the authors think science advances, effectively if you’re telling me this will improve science I think you first owe me a good model of how science works so I can better assess your claim. In the latter (‘which way is better?’) I obviously wanted simulated data, where we could find out which methods got closer to the truth, because we would actually know the truth.

This got me to musing about what my colleagues the other night would think of the challenge of simulating data for a many-analyst ecology study. I wonder if some would bristle that we don’t know enough to simulate such data, but if that’s true, I think we have a real problem. And the challenge to ask ecologists to simulate data to then give out for a many-analyst study seems to me perhaps a better place to focus our efforts if we want to improve ecology than to ask more people to do many-analyst studies on different datasets given different questions (further, I’d be more interested in a many-analyst study on simulated data).

I think the value of many-analyst studies lies elsewhere. First, to show the variation (in which case, we do not need so many of them) and then, perhaps to get better models for specific applications. This is where it occurred to me the cherry blossom competition I run with Jonathan Auerbach and David Kepplinger is a many-analyst study of sorts, but in a very different spirit. We’re looking for more predictive models! We’re not disturbed by the variation in the way that I think I was supposed to be by some of the many-analyst studies I read (skimmed).

Indeed the most interesting part to me of Gould et al. was the discussion where issues were raised about whether the research questions given to the teams were too vague or whether readers should even be surprised by this variation. They write:

We recognize that some researchers have long maintained a healthy level of skepticism of individual studies as part of sound and practical scientific practice, and it is possible that those researchers will be neither surprised nor concerned by our results. However, we doubt that many researchers are sufficiently aware of the potential problems of analytical flexibility to be appropriately skeptical. I hope that our work leads to conversations in ecology, evolutionary biology, and other disciplines about how best to contend with heterogeneity in results that is attributable to analytical decisions.

I see their point, but then I wonder about how well they added to the conversation when I have no idea why they did so many tests or what their question(s) exactly was. I also think any such conversation should be framed with a solid grounding in both how science works (who knows, but there are theories and ideas, none of which were really mentioned) and how statistics work. Both of these areas should give all scientists a good dose of skepticism, so do we really need many-analyst studies for that? I would hope they’re offering something more.

Three side notes: 1) This work reminds me of the debate over whether the bird Parus major was declining due to misting with its caterpillar food resource with anthropogenic warming. The Dutch team said it was happening in their birds, the British team said it was not happening in their woods, but whenever I worked on them together in a hierarchical model they looked the same to me (see interactions 221 for the Dutch and 180 for the UK in Fig 1B here). And then this paper came out after the Dutch team had more data. 2) Another big take-home to me from the discussion of Gould et al. was a search for a simple answer of how to fix this mess, as opposed to acknowledging there’s no one or easy thing that would fix this, as often mentioned on this blog. 3) I was also sort of disturbed how these papers seemed to think of model averaging and model comparison, but I will save that for another post.