Baking priors

Win-Vector Blog 2015-10-13

There remains a bit of a two-way snobbery that Frequentist statistics is what we teach (as so-called objective statistics remain the same no matter who works with them) and Bayesian statistics is what we do (as it tends to directly estimate posterior probabilities we are actually interested in). Nina Zumel hit the nail on the head when she wrote an article explaining the appropriateness of the type of statistical theory depends on the type of question you are trying to answer, not on your personal prejudices.

We will discuss a few more examples that have been in our mind, including one I am calling “baking priors.” This final example will demonstrate some of the advantages of allowing researchers to document their priors.

Figure 1: two loaves of bread.

Terminology

Statistical terminology tends to be technical, and also tends to have some deliberately evocative overlaps with standard English. For this writeup we are going to try to use a few words in a fairly narrow and consistent manner that we will define here.

Confidence. We will always use this in the technical frequentist statistical sense, and not in the common English sense of “degree of belief.” The statistical meaning is how frequent an outcome happens under many repetitions of an experiment. A “confidence interval” is an interval that repeated experiments hit with at least a stated confidence. Confidence intervals are about frequencies of repeated experiments hitting a near a given value. Confidence intervals are considered “objective”, but depend on analysis technique. Example use: “90% confidence interval.”
Credible interval. This is a Bayesian tool. A credible interval is an interval that the quantity to be estimated is estimated as likely to be in. For example the interval 100 to 200 pounds could be a “90% credible interval” for my weight. This differs from a confidence interval as a credible interval is about a single instance and depends on the estimator’s subjective priors .
Plausibility. This will be used to mean how believable data is given a hypothesis. Written as P[evidence | hypothesis]. Notice this is not the same as how likely a hypothesis is given some data.
Posteriors. A Bayesian term. These are the probabilities derived from calculation. It is what is believed after we have combined our priors with data. Written is P[hypothesis | evidence]
Priors. A Bayesian term. These are the estimated or believed probabilities that are used in calculation. Written as P[hypothesis]. Bayesian calculation requires them. The point of frequentist calculation is to not need them.
Probability. This one doesn’t have a single meaning. In a frequentist framework it always means the limiting frequency of an occurrence as an experiment is repeated again and again (also called an objective probability). In a Bayesian framework it means the estimated chance of a single event happening (also called a subjective belief). In mathematics it is anything that obeys the Kolmogorov probability axioms (independent of interpretation or meaning).
Significance. In standard English it roughly means important (what a statistician might call “effect size” or “clinical significance”). In frequentist statistics is is roughly the estimate probability of being wrong (results being significant when this estimate is small).

The problem

The difference between frequentist statistics and Bayesian statistics isn’t the math or calculations (both fields borrow techniques from each other). For example “Bayes’ Law” is a theorem both in frequentist and in Bayesian statistics. This means both theories allow the equation:

P[ hypothesis | evidence ] = P[ evidence | hypothesis ] * P[ hypothesis ] / P[ evidence ]

Frequentists feel it is reaching to think you will always have serviceable estimates of P[ hypothesis ] (the “priors”) and Bayesian methods depend on having such an estimate (the P[ evidence ] can be assumed away in either theory).

Frequentist statistics versus Bayesian statistics, in law

Legal examples

“Reference guide on statistics.” 2nd ed.

The advertised advantage of Frequentist or objective statistics is they manage to compute things like confidence intervals without requiring any analyst supplied priors. We want to avoid needing priors as we may not have them, and they may allow the analyst to put their thumb on the scale and tilt evidence. Without the common frequentist indoctrination you would first think inference without priors is impossible. But such inference is made possible through some fairly clever and subtle definitions (including so-called sampling distributions). The cleverness comes back in that confidence intervals tend not be be statements about the experiment in hand, but statements about imaginary possible replications of experiment (a subtly different situation).

Some of the nuance can be found in the excellent D.H. Kaye and D.A. Freedman. “Reference guide on statistics.” 2nd ed. Federal Judicial Center, Washington, D.C. (2000) [pdf]. In this we find:

[Title VII of the Civil Rights Act, exclusionary rules example] Posterior probability. Given the observed disparity of 20 percentage points in the sample, what is the probability that —in the population as a whole— men and women have equal pass rates? This question is of direct interest to the courts. For a subjectivist statistician, posterior probabilities may be computed using “Bayes’ rule.” Within the framework of classical statistical theory, however, such a posterior probability has no meaning.

This is really interesting. The example is cleverly chosen to be examining the plausibility that repeated applications of rule have been non-exclusionary. It is saying that in the frequentist framework you can’t estimate P[rule non exclusionary | observed evidence], even though this is “of direct interest to the courts.” You can, however, compute P[observed evidence | rule non exclusionary] (the plausibility of the observed outcomes under a null-hypothesis of “equal pass rates”).

This example is extremely clever. The critique of a repeatedly applied rule is a perfect frequentist question. The frequentist statement “it is implausible that this data came from a fair rule” is an acceptable substitute for the desired answer “what are the odds the rule was fair” (which is not equivalent). There is a footnote that appears to further endorse the frequentist framework:

This classical framework is also called “objectivist” or “frequentist,” by contrast with the “subjectivist” or “Bayesian” framework. In brief, objectivist statisticians view probabilities as objective properties of the system being studied. Subjectivists view probabilities as measuring subjective degrees of belief. Section IV.B.1 explains why posterior probabilities are excluded from the classical calculus, and section IV.C briefly discusses the subjectivist position. The procedure for computing posterior probabilities is presented infra Appendix. For more discussion, see David Freedman, Some Issues in the Foundation of Statistics, 1 Found. Sci. 19 (1995), reprinted in Topics in the Foundation of Statistics 19 (Bas C. van Fraasen ed., 1997).

“Reference Manual on Scientific Evidence”, Third Edition.

But when the above guide was revised into the “Reference Manual on Scientific Evidence”, Third Edition, [pdf] the example changes to:

Computing posterior probabilities. Given the sample data, what is the probability of the null hypothesis? The question might be of direct interest to the courts, especially when translated into English; for example, the null hypothesis might be the innocence of the defendant in a criminal case. Posterior probabilities can be computed using a formula called Bayes’ rule. However, the computation often depends on prior beliefs about the statistical model and its parameters; such prior beliefs almost necessarily require subjective judgment. According to the frequentist theory of statistics, prior probabilities rarely have meaning and neither do posterior probabilities.

This is not nearly as clean an example as the original. The new question (of if a given single defendant is innocent or guilty) is not naturally a frequentist question (there isn’t a natural repetition of experiment). The “innocence of the defendant” is about a single defendant, in this it is sophistry to pretend we want a “significance” (which depends or re-sampling or comparing many similar defendants). The question is in fact asking for something like the probability the defendant is guilty or innocent, and in this case sampling confidences and probabilities are not acceptable substitutes. The question is in fact natural Bayesian.

The question being “Bayesian” doesn’t imply we have a safe source priors (which is what is needed to make Bayesian reasoning safe). The frequentists are right to worry about the priors and try to exclude them as much as possible. For example: priors are yet another place to get the “multiple comparison” effect wrong (using the prior probability of a given person matching, instead of the larger prior probability of any of a large set of people matching, example).

The compromise

It looks like the informal legal practice is to prefer frequentist evidence. The judge or jury are presented with frequentist supported plausibilities (statements of the form P[ evidence | hypothesis ]) instead of probabilities of conclusions (statements of the form P[ hypothesis | evidence ]. The application of Bayes’ Law is then left to the intuition of the judge or jury (who likely supply their own priors, good or bad).

Frequentist statistics versus Bayesian statistics, in practice

One of the strengths of the frequentist paradigm is it applies even when you don’t have priors. This is sometimes incorrectly read to include situations where “you don’t have a Bayesian interpretation.” The recent paper Morey, R., Hoekstra, R., Rouder, J., Lee, M., & Wagenmakers, E.-J. (2015). “The fallacy of placing confidence in confidence intervals”, Psychonomic Bulletin & Review, 1–21. doi [pdf] makes this point strongly.

Superficially the paper appears to be a fairly neutral standard attack on “p-value hacking” (where experimenters repeat useless variations of useless experiments until one “fails to fail” and then claim “success”).

But the paper is in fact much more. It works some great examples to help justify the following very strong statements:

Unless an interpretation of the [frequentist confidence] interval can be specifically justified by some other theory of inference, confidence intervals must remain uninterpreted, lest one make arbitrary inferences or inferences that are contradicted by the data.

The “other theory of inference” plausibly being Bayesian. Later we find:

One must first choose which confidence procedure, of many, to use; if one is committed to the procedure that allows a Bayesian interpretation, then one’s time is much better spent simply applying Bayesian theory.

This definitely taking a side and similar to Wald’s “every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures” (Wikipedia).

Frequentist statistics versus Bayesian statistics, in the kitchen

The experiment

My example is baking bread. Figure 1 shows two loaves of bread I baked in Dutch ovens (a technique advocated in The Tartine Bread Book, Flour Water Salt Yeast). All that differed in the procedures between the loaves is the right loaf used half the dough (a change I tested in earlier experiments that did not greatly affect surface texture) and was baked without a lid (the effect to be tested, not what is recommended for the Dutch oven technique). Notice the right loaf has less of the of the (very desirable) caramelizing, expansion and splitting; and in fact as a dehydrated cracker like appearance in the cuts.

So what we want to test is: does omitting the lid change the bread?

The question

My question is: have I run enough experiments?

Bad theory

The rote knee-jerk answer (under a frequentist regime) is: no.

Under the null hypothesis that removing the lid has no effect we can take the estimated odds of having a undesirable crust as 50/50 (what we empirically observed) unconditioned by lid. So we would expect to see the two loaves differ this much at least half the time. We haven’t learned much, so we didn’t run enough experiments.

This is ignoring my experience baking many loaves with the lid on. Let’s try to improve the analysis by incorporating more experience. I never (in about 20 prior lid-on trials) saw this undesirable cracker-like crust. Since it is (deliberately) hard to incorporate priors into a frequentist analysis I bring this information in by pooling my data. I pretend I baked 21 loaves with the lid on and 1 with the lid off.

The appropriate unconditioned null-hypothesis is now 1 in 22 loaves spontaneously form an undesirable cracker crust. So the odds of seeing such a crust in the single lid-off experiment are about 1/22 under the null-hypothesis that the lid has no effect. This is a significance or p=0.045 and I can (weakly) reject the null hypothesis (as all the other scientists use a p=0.05, so I am just delaying publication if I insist on stricter standards). Note the last part is sarcasm using the common false assumption the desirable goal of research is publication, not correct results: one of the faults of “significance based science” is the ritualistic use of p=0.05 without any introduction of an appropriate pricing of different error types.

Good theory

I use the fact that Bayesian theory lets me introduce a prior of my own choosing. I then use physical reasoning to derive my prior. I know my oven is reliable (achieves the same temperature on a given setting from use to use) and bread is a large collective of microscopic structures (so any observed macroscopic structure is an aggregate of very many little events and thus subject to strong concentration inequalities).

So my prior for what happens with the lid off is going to encode a lot of uncertainty in what happens when you remove the lid, and a lot of certainty in the repeatability of measurement. I pick my prior of seeing the uncovered bread looking significantly different than the covered bread as Beta(a,b) distributed with a,b picked from a distribution such that:

   variance(Beta(a,b)) = a b / ((a+b)^2 (a+b+1)) = 1/1000000   mean(Beta(a,b)) = a / (a+b) one of {1/1000,999/1000} with odds 50/50

The first line expresses my physics based belief that bread baking in my oven is very repeatable. The second line is convenient of expressing my uninformed-ness of what happens when I take the lid off (but again insisting the effect is almost deterministic, and reliably observable). a / (a+b) near zero encodes “no lid off effect” and a / (a+b) near 1 encodes a “lid off effect.”

Aside, the prior good and bad

This prior does express one more thing which I did not believe. It says the odds of something weird happening with the lid off is 0.5 (by symmetry). I didn’t believe that. My references say to bake with the lid on. But I am intentionally not going to improve my prior to say there are good odds something weird is going to happen with the lid off. The Bayesian inference is so strong that there is no reason to further speed it up with a “better” prior (though obviously one or two experiments do not let get this by appeal to the Bernstein–von Mises theorem). And it is unsatisfying to encode the result we expect in our prior.

Back to our calculation.

Schematically we have:

   posterior(lid off effect | weird bread) = P[weird bread | lid off effect] *                   prior(lid off effect) /  P[weird bread]   posterior(no lid off effect | weird bread) = P[weird bread | no lid off effect] *                   prior(no lid off effect) / P[weird bread]

Substituting prior(lid off effect) = prior(no lid off effect) = 1/2 and some standard algebra gives us:

   posterior(lid off effect | weird bread) = P[weird bread | lid off effect] /      ( P[weird bread | lid off effect ] + P[weird bread | no lid off effect] )   posterior(no lid off effect | weird bread) = P[weird bread | no lid off effect] /      ( P[weird bread | lid off effect ] + P[weird bread | no lid off effect] )

Exactly calculating P[weird bread | lid off effect] and P[weird bread | no lid off effect] is a bit involved, but can be automated. The important thing is we have:

P[weird bread | no lid off effect] is about 0.01 P[weird bread | lid off effect ] is about 0.99

Yielding

posterior(lid off effect | weird bread) is about 0.99 posterior(no lid off effect | weird bread) is about 0.01

Under this style of analysis one experiment lets us say with some statistical certainty: leaving the lid off makes for weird bread. Obviously documenting and justifying the calculations is now critical, as so much depends on them. However, the steps taken are all standard- so fully documenting them is possible.

Another thing to check is that the conclusion is not especially sensitive to the exact choice of the distribution of a, b expressing the low variance (as our choice was somewhat arbitrary). Note: sensitive in this case means quantities near zero stay near zero and quantities near 1 stay near 1. That is 1/1000 and 1/1000000 are both “small” (even though they have a huge relative difference). The conclusion is very sensitive to the bimodal nature of our prior- so it is in fact driving the analysis.

So under the assumption our stated model of bread physics is correct (an assumption the frequentist analysis did not need, but also did not benefit from) we have definitely run enough experiments to find out if a is large or small. The results are definitive under fairly plausible assumptions. The assumptions had a huge impact, but are not the most likely source of experimental error (so much else can go wrong).

This is part of the benefit of “journals reject p-values“. Researchers have to state (and open for criticism) what they believe (by way of priors). This can speed up well designed experiments and cut down on types of research that are thinly veiled p-hacking frauds. It becomes harder to support useless “I have no hypothesis, but I have an instrument and a grant” fishing expeditions.

Conclusion

One of the triumphs of frequentist statistics is the lack of need of priors (which are often not actually available). However, in some situations Bayesian methods can greatly speed up inference- with approximate notional priors.

Repetition of experiment is not the only way to achieve statistical significance. Rote insistence of repetition in the case of modelable outcomes is wasteful and silly (especially for assured expensive or harmful procedures).