Concerns about the z-curve method

Statistical Modeling, Causal Inference, and Social Science 2026-01-10

This is Erik: A few weeks ago, Andrew blogged about a paper by Richard Morey and Clint Davis-Stober entitled “On the poor statistical properties of the P-curve meta-analytic procedure”. Andrew quoted Morey:

We make the point that many of these techniques were never vetted by experts, and often are just “verified” by a few simulations. For tests, this is not good enough, but nevertheless these methods can get popular because (in my opinion) they tell people what they want to hear.

I believe that another meta-analytic method called z-curve (Brunner and Schimmack (2020), Bartos and Schimmack (2022), Schimmack and Bartos (2023)) has similar problems.

Recall that the signal-to-noise ratio (SNR) in statistics is the ratio of the true effect to the standard error of its estimator. If we make the “usual assumptions” then the z-statistic (the estimator divided by its standard error) has the normal distribution with mean SNR and standard deviation 1.

If we have a collection of studies, then the distribution of the z-statistics is the convolution (sum) of the distribution of the SNRs of the studies and the standard normal distribution. If we’ve estimated the distribution of the z-statistics, we can get the distribution of the SNRs by deconvolution. Deconvolution is known to be very unstable. That means that we need very many data points (studies) or very strong assumptions – preferably both – to get an accurate result.

The z-curve method is based on the assumption that the absolute values of the SNRs have a discrete distribution supported on 0,1,2,…, 6. Note that SNR=0 corresponds to “null effects”. To circumvent the effects of selection on statistical significance, z-curve uses only the absolute values of the z-statistics which exceed 1.96 in magnitude to estimate the 7 probabilities. Deconvolution is bad enough, but it gets much worse if only such a small part of the data is used. This makes uncertainty quantification especially important.

The z-curve method as implemented in the R package zcurve provides (among other things) estimates and confidence intervals of the expected discovery rate (EDR) and the expected replicability rate (ERR). I believe these are defined in my terminology as

  • EDR=P(|z|>1.96)
  • ERR=P(|z_repl| > 1.96 and z_repl × z > 0 | |z|>1.96)

The zcurve package also provides an estimate of “Soric’s FDR” but that is just a simple (monotone) transformation of the EDR.

It should be clear that z-curve’s estimate of P(SNR=0) (i.e. the proportion of “null effects”) will be especially noisy because studies with SNR=0 contribute relatively little to the significant z-statistics. Consequently, the estimate of the EDR will be very noisy too. To quantify this uncertainty, the authors use the bootstrap. By default, the zcurve function provides “robust” intervals by adding  5 percentage points to the confidence interval of the EDR and 3 percentage points to the confidence interval of the ERR. This approach is “verified” by a few simulations. Unfortunately, even the adjusted intervals do not provide correct coverage.

To illustrate the problem, I’ve done a small simulation. I generate samples of size n=100 from the two-component mixture 0.25×N(0,1) + 0.75×N(4,1). In 40 out of 100 simulations, the null component is missed entirely. In other words, P(SNR=0) is estimated to be zero. The problem is easy to see from a typical example (see the figure below). The null component is essentially “invisible”  from the observations that exceed 1.96.

The consequence is that across 100 simulations, the coverage of the 95% “robust” confidence intervals is incorrect. In particular,

  • The coverage of the EDR is 65% (CI: 55%-74%).
  • The coverage of the ERR is 100% (CI: 96%-100%)

I shared my concerns with the authors Ulrich Schimmack, Jerry Brenner and Frantisek Bartos. Bartos responded that he generally agrees with the simulation, but notes that the coverage does come close to nominal when the sample size is increased from n=100 to n=1000. I responded that the zcurve function accepts as few as 10 significant z-statistics, and that most meta-analyses don’t have 1000 studies. Bartos wrote:

To be fair, I agree that we should’ve been explicit about the recommended sample size in the original article (and probably add a warning to the method if used with less than XXX estimates). I didn’t anticipate that people would apply z-curve to small meta-analyses. In my mind, the purpose of the tool (including our examples) is larger-scale meta-epidemiological projects.

Bartos also noted:

With respect to the simulations – although apparently imperfect – I still think that we did actually a much better job than most published methods. (…) The commonly used alternatives for the same purpose at the time were p-curve (for ERR and EDR) and Jager and Leek’s mixture model (for FDR) which both have much worse properties in my opinion. As such, I view this development as a step forward.

In my opinion, statistical methods should be reliable when their assumptions are met. I don’t think unreliable methods should be used because no better methods are available.