When Rankings Are Just Data Porn
eagereyes 2017-01-11
Rankings are a common way of talking about data: who made the most money, who won the most medals, etc. But they hide issues in the underlying data. Is the difference between first and second meaningful or just noise? Here is a data video that nicely demonstrates the problem.
Watch the first few minutes of this video about baby names in the U.S. over time. I find it fascinating, though not for the reason most people probably do.
I couldn’t make it past the first part about girls’ names, and not because the map was so enthralling. I kept staring at that bar chart on the right. That was way more interesting and revealing to me.
Let’s start in 1910: Mary leads, with somewhere between 4.5% and 5% of girls born that year being named Mary. Five percent! That means that over 95% of newborn girls were not named Mary. How is something popular when it’s less than 5%?
But it gets better. Watch the video again and keep an eye on the scale of the bar chart. It keeps getting smaller, until in 2014, the most popular name, Emma, is barely above 1%. Almost 99% of newborn girls are not named Emma! Again, how is that a popular name?
What I find interesting about this is how it shows how names are becoming more diverse. There are many more names now, and parents no longer feel bound to give only common or well-established names.
The data behind this is not easy to find. The Social Security Administration lists the top five girls’ and boys’ names by state for the years 1980 to 2015. Though bafflingly, they report the number of kids rather than a percentage of births. The differences in population between states (and thus, presumably, between the number of births) are not trivial, however.
But the absolute numbers do show how small the margins are. In 2014, Vermont had the same number of Emmas as Olivias (40 each), in South Dakota, 60 Harpers outnumbered 59 Avas. A single decision here would have flipped the ranking. In many states, the difference between the top names is a handful of births or a few dozen. These are not sweeping favorites, they’re flukes.
Even the populous states (or, more precisely, the ones with many births) have narrow margins, though. In 2014, California had 502,879 births according to the National Center for Health Statistics. Ranked first among girls’ names was Sophia, with 3,172 births, over Isabella, with 2,717. That's a difference of 455, or less than 0.02% of newborn girls (assuming 50% girls, since I can't find a gender breakdown). In Florida, Sophia lost to Isabella that year, 1,461 over 1,237, or about 0.2%. Texas boasts 2,183 Emmas over 2,153 Sophias, another margin in the hundredths of a percent.
These rankings are meaningless. The differences they are based on are so tiny that they are of no consequence. Rankings have an air of authority and precision, but they hide all of that uncertainty. Maybe that’s why they’re so popular.
To be fair, the person who made this video tried to take this into account a bit by encoding the difference between first and second in the color’s saturation, but I don’t see how anybody would be able to keep track of that. If you really pay attention, you can see the map get fainter over time, though.
But really, the entire idea of a single most popular name per state is nonsense, especially in the last 20 years or so. It makes for a pretty animated map, sure. But in the end, it’s just data porn.