How effective visualization brings data alive

Junk Charts 2014-05-08

Back in 2009, I wrote about a failed attempt to visualize regional dialects in the U.S. (link). The raw data came from Bert Vaux's surveys. I recently came across some fantastic maps based on the same data. Here's one:

These maps are very pleasing to look at, and also very effective at showing the data. We learn that Americans use three major words to describe what others might call "soft drinks". The regional contrast is the point of the raw data, and Joshua Katz, who created these maps while a grad student at North Carolina State, did wonders with the data. (Looks like Katz has been hired by the New York Times.)

The entire set of maps can be found here.

***

What more evidence do we need that effective data visualization brings data alive... the corollary being bad data visualization takes the life out of data!

Look at the side by side comparisons of two ways to visualize the same data. This is the "soft drinks" question:

And this is the "caramel" question:

The set of maps referred to in the 2009 post can be found here.

***

Now, the maps on the left is more truthful to the data (at the zip code level) while Katz applies smoothing liberally to achieve the pleasing effect.

Katz has a poster describing the methodology -- at each location on the map, he averages the closest data. This is why the white areas on the left-side maps disappear from Katz's maps.

The dot notation on the left-side maps has a major deficiency, in that it is a binary element: the dot is either present or absent. We lost the granularity of how strongly the responses are biased toward that answer. This may be the reason why in both examples, several of the heaviest patches on Katz's maps correspond to relatively sparse regions on the left-side maps.

Katz also tells us that his maps use only part of the data. For each point on his maps, he only uses the most frequent answer; in reality, there are proportions of respondents for each of the available choices. Dropping the other responses is not a big deal if the responses are highly concentrated on the top choice but if the responses are evenly split, or well-balanced say among the top two choices, then using only the top choice presents a problem.