Interrogating p-values
Statistical Modeling, Causal Inference, and Social Science 2013-06-04

This article is a discussion of a paper by Greg Francis for a special issue, edited by E. J. Wagenmakers, of the Journal of Mathematical Psychology. Here’s what I wrote:
Much of statistical practice is an effort to reduce or deny variation and uncertainty. The reduction is done through standardization, replication, and other practices of experimental design, with the idea being to isolate and stabilize the quantity being estimated and then average over many cases. Even so, however, uncertainty persists, and statistical hypothesis testing is in many ways an endeavor to deny this, by reporting binary accept/reject decisions.
Classical statistical methods produce binary statements, but there is no reason to assume that the world works that way. Expressions such as Type 1 error, Type 2 error, false positive, and so on, are based on a model in which the world is divided into real and non-real effects. To put it another way, I understand the general scientific distinction of real vs. non-real effects but I do not think this maps well into the mathematical distinction of θ=0 vs. θ≠0. Yes, there are some unambiguously true effects and some that are arguably zero, but I would guess that the challenge in most current research in psychology is not that effects are zero but that they vary from person to person and in different contexts.
But if we do not want to characterize science as the search for true positives, how should we statistically model the process of scientific publication and discovery? An empirical approach is to identify scientific truth with replicability; hence, the goal of an experimental or observational scientist is to discover effects that replicate in future studies.
The replicability standard seems to be reasonable. Unfortunately, as Francis (in press) and Simmons, Nelson, and Simonsohn (2011) have pointed out, researchers in psychology (and, presumably, in other fields as well) seem to have no problem replicating and getting statistical significance, over and over again, even in the absence of any real effects of the size claimed by the researchers.
. . .
As a student many years ago, I heard about opportunistic stopping rules, the file drawer problem, and other reasons why nominal p-values do not actually represent the true probability that observed data are more extreme than what would be expected by chance. My impression was that these problems represented a minor adjustment and not a major reappraisal of the scientific process. After all, given what we know about scientists’ desire to communicate their efforts, it was hard to imagine that there were file drawers bulging with unpublished results.
More recently, though, there has been a growing sense that psychology, biomedicine, and other fields are being overwhelmed with errors (consider, for example, the generally positive reaction to the paper of Ioannidis, 2005). In two recent series of papers, Gregory Francis and Uri Simonsohn and collaborators have demonstrated too-good-to-be-true patterns of p-values in published papers, indicating that these results should not be taken at face value.
. . .
Although I do not know how useful Francis’s particular method is, overall I am supportive of his work as it draws attention to a serious problem in published research.
Finally, this is not the main point of the present discussion but I think that my anti-hypothesis-testing stance is stronger than that of Francis (in press). I disagree with the following statement from that article:
For both confirmatory and exploratory research, a hypothesis test is appropriate if the outcome drives a specific course of action. Hypothesis tests provide a way to make a decision based on data, and such decisions are useful for choosing an action. If a doctor has to determine whether to treat a patient with drugs or surgery, a hypothesis test might provide useful information to guide the action. Likewise, if an interface designer has to decide whether to replace a blue notification light with a green notification light in a cockpit, a hypothesis test can provide guidance on whether an observed difference in reaction time is different from chance and thereby influence the designer’s choice.
I have no expertise on drugs, surgery, or human factors design and so cannot address these particular examples—but, speaking in general terms, I think Francis is getting things backward here. When making a decision, I think it is necessary to consider effect sizes (not merely the possible existence of a nonzero effect) as well as costs. Here I speak not of the cost of hypothetical false positives or false negatives but of the direct costs and benefits of the decision. An observed difference can be relevant to a decision whether or not that difference is statistically significant.
The post Interrogating p-values appeared first on Statistical Modeling, Causal Inference, and Social Science.