On the cool maps about baseball fandom

Junk Charts 2014-05-20

Josh Katz, who did the dialect maps I featured recently, is at it again. He's one of the co-authors of a series of maps (link) published by the New York Times about the fan territorities of major league baseball teams.

Similar to the dialect maps, these are very pleasing to look at, and also statistically interesting. The authors correctly point out that the primary points of interest are at the boundaries, and provide fourteen insets on particular regions. This small gesture represents a major shift from years past, when designers would have just printed an interactive map, letting readers figure out where the interesting stuff is.

The other interesting areas are the "no-man lands", the areas in which there are no local teams. The map uses the same kind of spatial averaging technology that blends the colors. The challenge here would be the larger number of colors.

I'd have preferred that they have given distinct colors to the teams like the Yankees and the Red Sox that have broader appeal. Maybe the Yankees is the only national team they discovered, since it does have the unique gray color which is very subtle.

I also think it is smart to hide the political boundaries of state, zip, etc. in the maps (unless you click on them).

I'd like to see a separate series of maps: small multiples by team, showing the geographical extent of each team. This is a solution to the domination issue to be addressed below.

***

The issue of co-dominant groups I discussed in the dialect maps also shows up here. Notably, in New York, the Mets are invisible, and in the Bay Area, the Oakland As similarly do not appear on the map.

Recall that the each zip code is represented by the team with the highest absolute proportion of fans. It may be true that the Mets are perennial #2 in all relevant zip codes. Zooming into the Yankee territory, I didn't see any zip code in which Mets fans are more numerous. So this may be the perfect example of what falls through the cracks when the algorithm just drops everything but the top level.

***

Now, in the Trifecta checkup, we want to understand what the data is saying. I have to say this is a bit challenging. The core dataset contains Facebook Likes (aggregated to the zip-code level). It is not even clear what the base of those proportions are. Is it the total population in a zip code? the total Facebook users? the total potential baseball fans?

As I have said elsewhere, Facebook data is often taken to be "N=All". This is an assumption, not a fact of the data. Different baseball teams may have different social-media/Facebook strategies. Different teams may have different types of fans, who are more/less likely to be on Facebook. This is particularly true of cross-town rivals.

Apart from the obvious problem with brands buying or otherwise managing Likes, "Like" is a binary metric that doesn't measure fan fervor. It is a static measure as I don't believe Facebook users manage their list of Likes actively (please correct me if I am wrong about this behavior.)

We are not provided any real numbers, and none of the maps have scales. Unless we see some absolute counts, it is hard to know if the data make sense relative to other measures of fandom, like merchandise and ticket sales. With Facebook data, it is sometimes possible to have too much--in other words, you might find there are more team fans than potential baseball fans or even population in a specific zip code.

It is very likely that Facebook, which is the source of the aggregated data, did not want to have raw counts published. This is par for the course for the Internet giants, and also something I find completely baffling. Here are the evangelizers of privacy is dead, and they stockpile our data, and yet they lock the data up in their data centers, away from our reach. Does that make any sense?