Numbersense and true lies

Numbers Rule Your World 2014-02-24

Long before I came up with "numbersense," I wrote about "true lies" in data analysis. (link)

The nature of data, especially Big (as in multidimensional) Data, is that one can come up with an infinite number of statistical computations, all of which are "true" in the sense that one would obtain such statistics were one to plug the data into textbook formulas. Inevitably, some of these statistics lead to contradictions.

An example I give in the Prologue of Numbersense (link) is a case of Simpson's Paradox. There are two ways to compare two airlines' rate of delays during a given window of time at a common set of arriving airports. One can aggregate the number of flights across all airports, then compare the average rate of delay. Alternatively, one can compute a pair of delay rates for each airport, then compare the rates by airport. In the example given in the book, airline A came out ahead in the aggregate measure but was the more delayed at each of the individual airports. This is an instance of "true lies". Airline A is either better or worse, not both. Given that answer, one of the two methods leads to the wrong interpretation. But one cannot complain that there is anything wrong with the data or either formula used to compute the average.

***

I was thinking about "true lies" while reading the exchange between Alberto Cairo and Andy Kirk about the following chart, which prints a "truth," that half of US economic activity occur in major urban areas that constitute a tiny proportion of US territory.

Cairo complains that this chart is silly because about half the US population live in those orange urban areas so in reality, anyone who accepts the meme that this map has "incredible" insight is just surprised that half the US population live in major urban areas.

Kirk, who said he retweeted this map, wants us to stop whining:

I get that GDP is essentially a proxy indicator for where people are living yet I still have a novel interest in learning about the dynamics of the US. I *know* that there is not a uniform distribution of where people live (nowhere on earth has this) but it is still revealing for me to see anything that represents a proxy of this skewed population. I don’t think the map claims to be doing anything different to this so, in that sense, it doesn’t mislead or make false claims.

I will be writing on my other blog about the educational aspect of a chart like this, which is the other prong of Kirk's argument. That last sentence, which I bolded, strikes me as the argument that the true lie is true and therefore is beyond reproach. This is a crucial difference between doing statistics and doing pure math. In statistics, you can't win arguments by invoking the truth... if the truth is knowable, statisticians would all be unemployed.

The map does not make false claims but it leads readers to the conclusion that the orange areas are much more important than the blue region (equal economic activity but much smaller area). The first problem is that the types of economic activities are vastly different between those regions, and this significant factor is ignored.

The second problem is that the designer over-aggregated the data. All counties (or zip codes) are classified into two groups ("split in half") when in fact, the level of economic activity at the level of counties (or zip codes) is a gradient. Imagine plotting the economic activity index by county, ordered from the highest to the lowest. Do we see a dramatic drop-off after counting out half the counties (i.e., the pattern shown on the left chart below)? Or are we more likely to see the pattern shown on the right? If you see a distribution like the one shown on the right, would you summarize that with just two segments?

***

Cairo's general point is that good data visualizations require good data analyses. In turn, good data analysis requires numbersense.

***

Chapter 3 of Numbers Rule Your World (link) explores the question of aggregating data, which is central to statistical thinking. Aggregation features throughout Numbersense (link), particularly in Chapter 1 (school rankings), the chapters on economic statistics, and the chapter on fantasy football.

Also, you can learn statistical concepts from me at NYU. New course starting first week of March. More information here.