A tale of two discussion papers

Statistical Modeling, Causal Inference, and Social Science 2013-05-09

Over the years I’ve written a dozen or so journal articles that have appeared with discussions, and I’ve participated in many published discussions of others’ articles as well. I get a lot out of these article-discussion-rejoinder packages, in all three of my roles as reader, writer, and discussant.

Part 1: The story of an unsuccessful discussion

The first time I had a discussion article was the result of an unfortunate circumstance. I had a research idea that resulted in an article with Don Rubin on monitoring the mixing of Markov chain simulations. I new the idea was great, but back then we worked pretty slowly so it was awhile before we had a final version to submit to a journal. (In retrospect I wish I’d just submitted the draft version as it was.) In the meantime I presented the paper at a conference. Our idea was very well received (I had a sheet of paper so people could write their names and addresses to get preprints, and we got either 50 or 150 (I can’t remember which, I guess it must have been 50) requests), but there was one person who came up later and tried to shoot down our idea. The shooter-down, Charlie Geyer, has done some great work but in this case he was confused, I think in retrospect because we did not have a clear discussion of the different inferential goals that arose in the sorts of calculations he was doing (inference for normalizing constants of distributions) and which I was doing (inference for parameters in fitted models). In any case, the result was that our new and exciting method was surrounded by an air of controversy. In some ways that was a good thing: I became well known in the field right away, perhaps more than I deserved at the time (in the sense that most of my papers up to then and for the next few years were on applied topics; it was awhile before I published other major papers on statistical theory, methods, and computation). But overall I’d rather have been moderately known for an excellent piece of research than very well known for being part of a controversy. I didn’t seek out controversy; it arose because someone else criticized our work without seeing the big picture, and at the time neither he nor I nor my collaborator had the correct synthesis of my work and his criticism.

(Again, the synthesis was that he was trying to get precise answers for hard problems and was in a position where he needed to have a good understanding of the complex distributions he was simulating from, whereas I was working on a method to apply routinely in relatively easy (but nontrivial!) settings. For Charlie’s problems, my method would not suffice because he wouldn’t be satisfied until he was directly convinced that the Markov chain was exploring all the space. For my problems, Charlie’s approach (to run a million simulations and work really hard to understand the computation for a particular model) wasn’t a practical solution. His approach to applied statistics was to handcraft big battleships to solve large problems, one at a time. I wanted to fit lots of small and medium-sized models (along with the occasional big one), fast.)

Anyway, this “different methods for different goals” conversation never occurred, hence I left that meeting with an unpleasant feeling that our method was controversial, not fully accepted, and not fully understood. So I got it into my head that our article should be published as a discussion, so that Geyer and others could comment and we could respond.

But we never had that discussion, not in those words. Neither Charlie nor I nor Don Rubin was aware enough of the sociological context, as it were, so we ended up talking past each other.

In retrospect, that particular discussion did not work so well.

Here’s another example from about the same time, the Ising model. Here’s one chain from the Gibbs sampler. After 2000 iterations, it looks like it’s settled down to convergence (here we’re plotting the log probability density, which is a commonly used summary for this sort of distribution).

But then look at the second plot: the first 500 iterations. If we’d only seen these, we might have been tempted to declare victory too early!

At this point, the naive take-home point might be that 500 iterations was not enough but we’re safe with 2000. But no! Even that the last bit of those 2000 looks as stationary and clean as can be, if we start from a different point and run for 2000, we get something different:

This one looks stationary too! But a careful comparison with the graphs above (even clearer when I displayed these on transparency sheets and overlaid them on the projector) reveals that the two “stationary” distributions are different. The chains haven’t mixed, the process hasn’t converged. R-hat reveals this right away (without even having to look at the graphs, but you can look at the graphs if you want).

As I wrote in our article in Bayesian Statistics 4,

This example shows that the Gibbs sampler can stay in a small subset of its space for a long time, without any evidence of this problematic behavior being provided by one simulated series of finite length. The simplest way to run into trouble is with a two-chambered space, in which the probability of switching chambers is very low, but the above graphs are especially disturbing because the probability density in the Ising model has a unimodal (in the sense that this means anything in a discrete distribution) and approximately Gaussian marginal distribution on the gross scale of interest. That is, the example is not pathological; the Gibbs sampler is just very slow. Rather than being a worst-case example, the Ising model is typical of the probability distributions for which iterative simulation methods were designed, and may be typical of many posterior distributions to which the Gibbs sampler is being applied.

So that was my perspective: start from one point and the chain looks fine; start from two points and you see the problem. But Charlie had a different attitude toward the Ising example. His take on it was: the Ising model is known to be difficult, no one but a fool would try to simulate it with 2000 iterations of a Gibbs sampler. There’s a huge literature on the Ising model already!

Charlie was interested in methods for solving large, well-understood problems one at a time. I was interested in methods that would be used for all sorts of problems by statisticians such as myself who, for applied reasons, bite off more in model than we can chew in computation and understanding. For Charlie with the Ising model, multiple sequences missed the point entirely, as he knew already that 2000 iterations of Gibbs wouldn’t do it. For me, though . . . as an applied guy I was just the kind of knucklehead who might apply Gibbs to this sort of problem (in my defense, Geman and Geman made a similar mistake in 1984, I’ve been told), so it was good to have a practical convergence check.

Again, I think that in our discussion and rejoinder, Don and I presented our method well, in the context of our applied purposes. But I think it would’ve worked better as a straight statistics article. Nothing much useful came out of the discussion because none of us cut through to the key difference in the sorts of problems we were working on.

Part 2: A successful discussion

In the years since then, I’ve realized that communication is more than being right (or, should I say, thinking that one is right). Statistical ideas (and, for that matter, mathematical and scientific ideas in general) are sometimes best understood through their limitations. It’s Lakatos’s old “proofs and refutations” story all over again.

Recently I was involved in a discussion that worked out well. It started a few years ago with a post of mine on the differences between the sorts of data visualizations that go viral on the web (using some examples that were celebrated by statistician/designer Nathan Yau), as compared to statistical graphics of the sort that we are trained to make. It seemed to me that many visualizations that are successful with general audiences feature unique or striking designs and puzzle-like configurations, whereas the most successful statistical graphics have more transparent formats that foreground data comparisons. Somewhere in between are the visualizations created by lab scientists, who generally try to follow statistical principles but usually (in my view) try too hard to display too much information on a single plot.

My posts, and various follow-ups, were disliked by many in the visualization community. They didn’t ever quite disagree with my claim that many successful visualizations involve puzzles, but they didn’t like what they perceived as my negative tone.

In attempt to engage the fields of statistics and visualization more directly, I wrote an article (with Antony Unwin) on the different goals and different looks of these two sorts of graphics. Like many of my favorite papers, this one took a lot of effort to get into a journal. But finally it was accepted in the Journal of Computational and Graphical Statistics, with discussion.

The discussants (Stephen Few, Robert Kosara, Paul Murrell, and Hadley Wickham; links to all four discussions are here on Kosara’s blog) politely agreed with us on some points and disagreed with us on others. And then it was time for us to write our rejoinder.

In composing the rejoinder I finally came upon a good framing of the problem. Before we’d spoken of statistical graphs and information visualization as having different goals and looking different. But that didn’t work. No matter how often I said that it could be a good thing that an infovis is puzzle-like, or no matter how often I said that as a statistician I would prefer graphing the data like This but I can understand how graphing it like That could attract more viewers . . . no matter how much I said this sort of thing, it was interpreted as a value judgment (and it didn’t help when I said that something “sucked,” even if I later modified that statement).

Anyway, my new framing, that I really like, is in terms of tradeoffs. Not “two cultures,” not “different goals, different looks,” but tradeoffs. So it’s not stat versus infographics; instead it’s any of us trying to construct a graph (or, better still, a grid of graphs) and recognizing that it’s not generally possible to satisfy all goals at once, so we have to think about what goals are most important in any given situation:

In the internet age, we should not have to choose between attractive graphs and informational graphs: it should be possible to display both, via interactive displays. But to follow this suggestion, one must first accept that not every beautiful graph is informative, and not every informative graph is beautiful.

Yes, it can sometimes be possible for a graph to be both beautiful and informative, as in Minard’s famous Napoleon-in-Russia map, or more recently the Baby Name Wizard, which we featured in our article. But such synergy is not always possible, and we believe that an approach to data graphics that focuses on celebrating such wonderful examples can mislead people by obscuring the tradeoffs between the goals of visual appeal to outsiders and statistical communication to experts.

So it’s not Us versus Them, it’s each of us choosing a different point along the efficient frontier for each problem we care about.

And I think the framing worked well. At least, it helped us communicate with Robert Kosara, one of our discussants. Here’s what Kosara wrote, after seeing our article, the discussions (including his), and our rejoinder:

There are many, many statements in that article [by Gelman and Unwin] that just ask to be debunked . . . I [Kosara] ended up writing a short response that mostly points to the big picture of what InfoVis really is, and that gives some examples of the many things they missed.

While the original article is rather infuriating, the rejoinder is a great example of why this kind of conversation is so valuable. Gelman and Unwin respond very thoughtfully to the comments, seem to have a much more accurate view of information visualization than they used to, and make some good points in response.

Great! A discussion that worked! This is how it’s supposed to go: not a point-scoring debate, not people talking past each other, but an honest and open discussion.

Reflections

Perhaps my extremely, extremely frustrating experience early in my career (detailed in Part 1 above) motivated me to think seriously about the Lakatosian attitude toward understanding and explaining ideas. If you compare Bayesian Data Analysis to other statistics books of that era, for example, I think we did a pretty good job (although maybe not good enough) of understanding the methods through their limitations. But even with all my experience and all my efforts, this can be difficult, as revealed by the years it took for us to finally process our ideas on graphics and visualization to the extent that we could communicate with experts in these fields.

The post A tale of two discussion papers appeared first on Statistical Modeling, Causal Inference, and Social Science.