"Bayes and Big Data" (Next Week at the Statistics Seminar)

Three-Toed Sloth 2013-09-15

Summary:

Attention conservation notice: Only of interest if you care a lot about computational statistics.

For our first seminar of the year, we are very pleased to have a talk which will combine two themes close to the heart of the statistics department:

Steve Scott, "Bayes and Big Data"
Abstract: A useful definition of "big data" is data that is too big to fit on a single machine, either because of processor, memory, or disk bottlenecks. Graphics processing units can alleviate the processor bottleneck, but memory or disk bottlenecks can only be alleviated by splitting "big data" across multiple machines. Communication between large number of machines is expensive (regardless of the amount of data being communicated), so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication. Consensus Monte Carlo operates by running a separate Monte Carlo algorithm on each machine, and then averaging the individual Monte Carlo draws. Depending on the model, the resulting draws can be nearly indistinguishable from the draws that would have been obtained by running a single machine algorithm for a very long time. Examples of consensus Monte Carlo will be shown for simple models where single-machine solutions are available, for large single-layer hierarchical models, and for Bayesian additive regression trees (BART).
Time and place: 4--5 pm on Monday, 16 September 2013, in 1212 Doherty Hall

As always, the talk is free and open to the public.

— A slightly cynical historical-materialist take on the rise of Bayesian statistics is that it reflects a phase in the development of the means of computation, namely the PC era. The theoretical or ideological case for Bayesianism was pretty set by the early 1960s, say with Birnbaum's argument for the likelihood principle1. It nonetheless took a generation or more for Bayesian statistics to actually become common. This is because, under the material conditions of the early 1960s, such ideas could be only be defended and not applied. What changed this was not better theory, or better models, or a sudden awakening to the importance of shrinkage and partial pooling. Rather, it became possible to actually calculate posterior distributions. Specifically, Monte Carlo methods developed in statistical mechanics permitted stochastic approximations to non-trivial posteriors. These Monte Carlo techniques quickly became (pardon the expression) hegemonic within Bayesian statistics, to the point where I have met younger statisticians who thought Monte Carlo was a Bayesian invention2. One of the ironies of applied Bayesianism, in fact, is that nobody actually knows the posterior distribution which supposedly represents their beliefs, but rather (nearly3) everyone works out that distribution by purely frequentist inference from Monte Carlo samples. ("How do I know what I think until I see what the dice say?", as it were.)

So: if you could do Monte Carlo, you could work out (approximately) a posterior distribution, and actually do Bayesian statistics, instead of talking about it. To do Monte Carlo, you needed enough computing power to be able to calculate priors and likelihoods, and to do random sampling, in a reasonable amount of time. You needed a certain minimum amount of memory, and you needed clock speed. Moreover, to try out new models, to tweak specifications, etc., you needed to have this computing power under your control, rather than being something expensive and difficult to access. You needed, in other words, a personal computer, or something very like it.

The problem now is that while our computers keep getting faster, and their internal memory keeps expanding, our capacity to generate, store, and access data is increasing even more rapidly. This is a problem if your method requires you to touch every data point, and especially a problem if you not only have to touch every data point but do all possible pairwise comparisons, because, say, your model says all observations are dependent. This raises the possibility that Bayesian inference will become computationally infeasible again in the near future, not because our computers have regressed but because the size and complexity of interesting data sets will have rendered Monte Carlo infeasible. Bayesian data analysis would then have been a transient historical episode, belonging to the period when a desktop machine could hold a typical data set in memory and thrash through it a million times in a weekend.

Of course, I don't kno

Link:

http://bactra.org/weblog/1046.html

From feeds:

Statistics and Visualization » Three-Toed Sloth

Tags:

Date tagged:

09/15/2013, 04:30

Date published:

09/15/2013, 04:30