A big loser of the Iowa debacle is data science

Numbers Rule Your World 2020-02-05

CnnlogoThe Democratic Party officials offered #datascience as a sacrifice yesterday, and then CNN dutifully stabbed it.

As most of you may know, the Party made the inexplicable decision of releasing 62% of the results, which they knew would unleash hours of irresponsible programming in which pundits feasted on the partial data, and spun endless stories out of it. I just warned about "story time" a few days ago (see here), and we're here already.

****

Three major problems I had with the CNN coverage - and I had to turn it off. (Other channels I assume were equally bad. The online sources like FiveThirtyEight and Vox are much better.).

Cnn_iowa_mapFirstly, they showed a map with the Iowa counties. Every  county is given a solid color - the color of the presumptive winner, which is defined by who's leading in that county based on the 62% of precincts reporting.

This is irresponsible because (a) some counties will have lots of precincts reported while others will have few, and (b) no one told anyone that the precincts that have reported are randomly selected from the precincts within each county.

So their conclusion is based on the assumptions that within each county of Iowa, someone randomly selected 62% of the precincts to report the results. Obviously wrong.

Any notion of "leading" is thus meaningless, misleading, and conducive of irresponsible spinning. That the Democratic Party published these partial results with no timetable for the remaining data is yet another terrible decision.

***

Then, the analysts/reporters opined on topics like "did Bernie draw out the young vote?" or "did Pete show he could appeal to all types across the state?" With a straight face, they said they only had 62% of the data so anything could happen; then came the "but", now insert the conclusion they drew from just 62% of the data - without any adjustments.

Such conclusions require (a) assuming that the 38% remaining data look exactly like the 62% so that we don't actually need any more data and (b) assuming that the county-level data provide insights on demographic breakouts like young vote, moderates, etc. Both are obviously false.

***

Finally, these analyses constantly contradict themselves.

On the one hand, we are to believe that performance in Iowa does not prove that you will win or lose in states with more racial diversity (Sanders, Biden) but on the other hand, performance in Iowa proves that you can build a diverse coalition (Buttigieg).

On the one hand, some candidates will get the black vote because that's what the polls say (Biden) but on the other hand, they can get the black vote despite what the polls say (Buttigieg).

And of course, on the one hand, Iowa is a great predictor of the national election - this must be true given all the generalizations flying around tonight, but on the other hand, Iowa is a horrible indicator for the Democratic primaries, since the state's demographics are so different from the national average.

*** RIP data science. RIP statistics.

 

P.S. I listened to another 5 minutes of coverage late at night before again turning it off. The CNN host doubled down: now that results from another 10% of the precincts were available, he told viewers that in a close race, even this 10% is highly meaningful. So now, he is spinning even more elaborate stories from 10% of the data. Ouch! In the meantime, they still don't have a map showing the proportion of precincts that have published results by county. Hours and hours of story time, and still they haven't produced the one useful analysis that helps viewers understand the 62%!

There is a fourth howler that these pundits keep lobbing at us. They talk as if results from precincts are being reported as they come in, as a normal election night. In fact, the Democratic Party is selectively releasing partial results. They have all of the reports already. By not explaining how they decided which results to release first, they are hurting the legitimacy of this nomination process. [Nate Silver said as much in this post, although his concerns arose from how the delay in reporting will affect the accuracy of his predictive model.]