Worry about correctness and repeatability, not p-values

Win-Vector Blog 2013-04-05

In data science work you often run into cryptic sentences like the following:

Age adjusted death rates per 10,000 person years across incremental thirds of muscular strength were 38.9, 25.9, and 26.6 for all causes; 12.1, 7.6, and 6.6 for cardiovascular disease; and 6.1, 4.9, and 4.2 for cancer (all P < 0.01 for linear trend).

(From “Association between muscular strength and mortality in men: prospective cohort study,” Ruiz et. al. BMJ 2008;337:a439.)

The accepted procedure is to recognize “p” or “p-value” as shorthand for “significance,” keep your mouth shut and hope the paper explains what is actually claimed somewhere later on. I am going to go out on a limb here and say these type of sentences are gibberish and nobody actually understands them. From experience we know generally what to expect, but it isn’t until we read further we can precisely pin down what is actually being claimed. This isn’t the authors’ fault, they are likely good scientists, good statisticians, and good writers; but this incantation is required by publishing tradition and reviewers.

We argue you should worry about the correctness of your results (how likely a bad result could look like yours, the subject of frequentist significance) and repeatability (how much variance is in your estimation procedure, as measured by procedures like the bootstrap). p-values and significance are important in how they help structure the above questions.

The legitimate purpose of technical jargon is to make conversations quicker and more precise. However, saying “p” is not much shorter than saying “significance” and there are many different procedures that return p-values (so saying “p” does not limit you down to exactly one procedure like a good acronym might). At best the savings in time would be from having to spend 10 minutes thinking which interpretation of significance is most approbate to the actual problem at hand versus needing a mere 30 seconds to read about the “p.” However, if you don’t have 10 minutes to consider if the entire result a paper is likely an observation artifact due to chance or noise (the subject of significance) then you really don’t care much about the paper.

In our opinion “p-values” have degenerated from a useful jargon into a secretive argot. We are going to discuss thinking about significance as “worrying about correctness” (a fundamental concern) instead of as a cut and dried statistical procedure you should automate out of view (uncritically copying reported p’s from fitters). Yes “p”s are significances, but there is no reason to not just say what sort of error you are claiming is unlikely.

We started with bad writing on significance, let’s share an example of good writing:

Suppose that in an agricultural experiment four different chemical treatments of soil produced mean wheat yields of 28, 22, 18, and 24 bushels per acre, respectively. Is there a significant difference in these means, or is the observed spread due simply to chance?

(From Schaum’s Outlines, “Statistics” Fourth Edition, Murray R. Spiegel and Larry J. Stephens, McGraw-Hill, 2008.)

From the above paragraph you have some idea of what is going on and why you should care. Imagine you were asked to choose one of these soil treatments for your farm. You would want the one that actually is the best, not one that managed to fool you once. You care about actual correctness. The mathematician Gian-Carlo Rota called out an earlier version of this text as being the one that finally explained to him what the purpose of analysis of variance was. Things like this are why this book has been in continuous print since 1961.

From this same book:

When the first edition was introduced in 1961, the p-value was not as widely used as it is today, because it is often difficult to determine without the aid of computer software. Today p-values are routinely provided by statistical software packages since the computer software computation of p-values is often a trivial matter.

People knew how to do statistics properly before 1961, so they probably had interesting methods that work around the need for explicit p-values. In this article we will demonstrate R code to take all of the small steps to organize our data and produce summaries demonstrating the nature, correctness and repeatability of experimental results. There are things you should look for (small counts, large error bars and overlapping distributions) that give diagnostic clues long before you make it to the p-values.

There is nothing intrinsically wrong with p-values, but I hold that the slavish copying of them from computer results into reports has distracted us away from thinking about important issues of correctness, repeatability and significance.

What is significance? Significance is usually an estimate of the probability of some event you don’t want to happen looking like a favorable event you saw in your experimental data. It is a frequentist idea and you introduce a straw-man explanation of the data (that you hope to falsify or knock down) called the “null hypothesis.” For the soil treatment example it could be the probability that what we are identifying as the best soil treatment is actually an inferior soil treatment that “got lucky” during the measurements. We wish to show that the odds of this kind of error are low, and such low odds of error are called “high significance.” High significance does not guarantee you are not making a mistake (for one your modeling assumptions could be wrong). However, low significance is usually very bad. It says even assuming everything is the way you hoped, there is still a significant chance you are wrong. That should matter to you.

Even if you think significance doesn’t matter to you, it will matter to your clients, managers and peers. If you promote work that you has low significance (or worse yet, you haven’t checked some form of significance) you are promoting work that not only may fail, you are promoting work that may have an obvious large chance of failure. That can go over poorly in a project post-mortem. You should always work with the feeling that someday “the truth will out.” Not only will more data be collected in the future, but it will be obvious if you had enough evidence to justify the decisions you made earlier in a project. For example in Bob Shaw’s short story “Burden of Proof” a crime is committed in front of a device that will play the crime back years in the future. In the story there is no way to get the playback sooner, but knowing that someday the truth will out intensifies the detective dilemma: they can’t just make a convincing case they must make the right case. In the real world you can’t always be right, so you should measure and share your level of uncertainty. This why you calculate significance and why you want to effectively communicate what significance means to your possibly non-technical partners.

Let’s work through the significance claim from the paper relating muscular strength to mortality rates that we started with. The claim of the paper is that in a study of 10,000 people over an average follow-up period of 19 years that a statistically significant predictor of mortality rate was weakness in certain muscular strength tests. The paper goes on to claim that this relation is significant even when accounting for age and disease conditions. We can check what the paper claims on the relation between strength and mortality (as we can see the raw numbers in their reported tables), but we can’t check if the effect remains when controlling for other conditions as we don’t have enough of their data to reproduce that second analysis. So let’s end this article on a concrete note by exploring the relation between strength and mortality using the statistical package R.

From the paper’s tables 1 and 2 we can find the number of people in each of the muscular strength groups (called “lower”, “middle”, and “upper) and the number of deaths in each of these groups. The raw numbers are (typed by hand into R):

> # from tables 1 and 2 of http://www.bmj.com/content/337/bmj.a439> d = data.frame(MuscularStrength=c('lower','middle','upper'),      count=c(2920,2919,2923),deaths=c(214,143,146))> # make upper strength the reference level> d$MuscularStrength = relevel(as.factor(d$MuscularStrength),    ref='upper')> print(d)  MuscularStrength count deaths1            lower  2920    2142           middle  2919    1433            upper  2923    146

The obvious thing to look at is the death rates:

> # quickly look at rates and typical deviations (sqrt of variance)> #  http://en.wikipedia.org/wiki/Binomial_distribution> d$MortalityRate = d$deaths/d$count> d$MortalityRateDevEst = sqrt(d$MortalityRate*(1-d$MortalityRate)/d$count)> d$MortalityRateDevBound = sqrt(0.25/d$count)> print(d)  MuscularStrength count deaths MortalityRate MortalityRateDevEst MortalityRateDevBound1            lower  2920    214    0.07328767         0.004822769           0.0092529152           middle  2919    143    0.04898938         0.003995090           0.0092545003            upper  2923    146    0.04994868         0.004029222           0.009248166

It looks like the lower strength group has a mortality rate of about 7% over the 19 year interval which is above the 5% rate we see for the other two groups. This result is likely significant because we see each of these estimates has a standard error of around 0.5%, much lower than the 2% difference between groups we are seeing. In relative terms it looks like you can cut your mortality rate by 40% by not being in the lower muscle performance group. Notice also that these rates are radically different from the 3.89%, 2.59% and 2.66% reported in the summary. This is because we are using different units (our case deaths per study individual and theirs deaths per year) and the actual study is breaking deaths up into different causes.

From a business perspective at this point, using the rule of thumb that at least 3 standard errors is significant, we are done. The table above is exactly the right thing to show the client. They can see the important things: the size of the study, the general mortality rates, the size of the effect, and the likely error in measurement. The likely errors in measurement we have added to the table are the likely errors we would see in re-running a study such as this one. That is: it describes the distribution of results a second researcher trying to reproduce our result might see even when our estimates were exactly right. If there estimated deviations are large then we know even if we were right our work is unlikely to be reproduced with sample sizes similar to what we used (which is bad). If the estimated deviations are small then assuming we are right others should be able to reproduce our work (which is good).

We can produce a graphical version of this table as follows.

# plot likely observed distributions of deaths, assuming # rates are exactly as measured> library('ggplot2') # plotting> plotFrame = c()> deaths=0:sum(d$deaths)> for(i in 1:(dim(d)[[1]])) {  row = as.list(d[i,])  plotData = data.frame(deaths=deaths,     DeathRate=deaths/row$count,     MuscularStrength=row$MuscularStrength,     density=dbinom(deaths,size=row$count,prob=row$deaths/row$count))  plotFrame = rbind(plotFrame,plotData)}> ggplot() +    geom_line(data=plotFrame,      aes(x=DeathRate,y=density,color=MuscularStrength,      linetype=MuscularStrength)) +   geom_vline(data=d,aes(xintercept=deaths/count,color=MuscularStrength,      linetype=MuscularStrength))

This produces the following plot which shows, assuming we have measure the mortality rates for the three groups correctly, how follow-up studies (using the same data-set size we used) would likely look. The important thing to notice is how little the 3 distribution groups overlap with each other (and how little of each of them crosses the center line of the others).

DeathRate

At this point we have addressed the variance of our estimation procedure (an issue the bootstrap method also works on) which speaks to the repeatability of our work. We have not yet touched on the correctness (such as measuring the variance of a null hypothesis would help with) or specific-ness of our result (such as forming Bayesian estimates of the posterior distributions of the three mortality rates would help with).

While we haven’t quite gotten to the traditional frequentist significance interpretation yet, we are very close. In the frequentist notion of statistics instead of assuming our measurements are correct (which is a dangerous habit to get into) we pick something we don’t want to happen and try to show our data is very unlikely under such an assumption. For example we could assume that lower physical strength group has the same mortality rate (around 5%) as the other groups (which is bad as implies our 7% measurement is then wrong). The frequentist significance then calculates how often a 5% mortality rate group would return a sample with a 7% mortality rate. This fact is already represented on our graph has how much of the middle and lower distributions cross the lower-groups center line (in this case almost none of the density does this). So we do in fact have strong evidence of both a reproducible and significant result already in our graph and table.

In this case it is okay to leave significance un-calculated and informal. The important follow up questions are ones of practicality and causation: can we change people’s strength group and would changing their strength group change their mortality rate? These are the important questions, but they would have to be answered by new experiments as they can’t be addressed with just the data at hand. And that is why I don’t suggest spending too long on significance with this data, observational error (the topic of significance) in this particular case it is only one worry among many.

Of course not all studies work out this well. Many experiments generate results that are near the boundary of significance and non-significance. So we very much need to know how to precisely estimate significance. The easiest way to do this is to apply the appropriate model (in this case logistic regression) and copy the significances from the model’s supplied summary report. Before we can fit a model we must re-shape our data into an appropriate format. We could use the melt and cast operators from Hadley Wickham‘s reshape2 package (as illustrated in Your Data is Never the Right Shape), but we will instead use the join operator in his plyr package. We feel in this case the notion of joining more accurately expresses the fluid transformations you need to be able to perform on data. The commands to create the new data format are as follows:

# convert data into a longer format and get at same facts as in a model> library('plyr')    # joining data> outcomes = data.frame(outcome=c('survived','died'))> outcomes$dummy = 'a'> d$dummy='a'> joined = join(d,outcomes,by=c('dummy'),type='full',match='all')> joined$count = ifelse(joined$outcome=='survived',                      joined$count-joined$deaths,                      joined$deaths)> data = subset(joined,select=c(MuscularStrength,count,outcome))> print(data)  MuscularStrength count  outcome1            lower  2706 survived2            lower   214     died3           middle  2776 survived4           middle   143     died5            upper  2777 survived6            upper   146     died

And now that we have prepared, the step of interest is essentially a one-liner:

> model = glm(outcome=='died'~MuscularStrength,   weights=data$count,family=binomial(link='logit'),data=data)> summary(model)Call:glm(formula = outcome == "died" ~ MuscularStrength, family = binomial(link = "logit"),     data = data, weights = data$count)Deviance Residuals:      1       2       3       4       5       6  -20.30   33.44  -16.70   29.37  -16.87   29.58  Coefficients:                       Estimate Std. Error z value Pr(>|z|)    (Intercept)            -2.94552    0.08491 -34.691  

What we are looking for is the value in the column named "Pr(>|z|)" for the row MuscularStrengthlower. This is the so-called p-value (notice not even the fitting software is so rude as to use just "p") and in this case it is 0.000226 which says there is about a one in four thousand chance of seeing a relation this strong between muscular strength and mortality if there were in fact no correlation (the so-called null hypothesis). The result is in fact statistically significant.

As a word of warning look at the deviance and null deviance reported in this model (3831.6 and 3851.3) respectively. Treating deviance as an analogy for variance we see our model explains only 0.5% of the variation outcomes (who dies and who lives). This is why to not rely too much on global variance style measures when evaluating models: they behave too much like accuracy and miss a lot of what is going on. Notice instead this model identifies a group of around 1/3rd of the subjects that if muscle weakness was in fact causing mortality (it might not be) perhaps exercises could be used to reduce the mortality rate of this group by 40%. The hope is this would be an opportunity to cut down the mortality rate of the overall population by as much as 13%, which would be huge. The original authors know this, this is why they made it the title of their study. And this is the kind of thing you miss if you just look at modeling summaries instead of getting your hands in the data.

There are a few additional points to share here. In principal a statistician would see the estimates and standard errors as mere details of parameterization to be integrated out along the path to computing significance. To a business the actual values of the estimates, relative rates in different groups and sizes of expected errors are all of vital interest. Significance is a important check if these values are right. But we also want the actual values available for discussion and possible use. We feel working more with the data in small fluid steps is of more benefit to the data scientist and client that submitting data hopefully (in the right format) to a monolithic tool that quickly returns a single answer without exposing the intermediate tables, graphs and calculations. You lose a lot of opportunities to notice an anomaly when you don't look at the intermediate results.

In conclusion: you don't directly care about p-values; you care about correctness, repeatability and significance.