Of scripts, scraping and quizzes: how data journalism creates scoops and audiences
Data & Society / saved 2014-01-28
Summary:
As last year drew to a close, Scott Klein, a senior editor of news applications at ProPublica. made a slam-dunk prediction: “ in 2014, you will be scooped by a reporter who knows how to program .”
While the veracity of his statement had already been shown in numerous examples, including the ones linked in his post, two fascinating stories published in the month since his post demonstrate just how creatively a good idea and a clever script can be applied — and a third highlights why the New York Times is investing in data-driven journalism and journalists in the year ahead.
Tweaking Twitter data
One of those stories went online just two days after Klein’s post was published, weeks before the new year began. Jon Bruner, a former colleague and data journalist turned conference co-chair at O’Reilly Media, decided to apply his programming skills to Twitter , randomly sampling about 400,000 accounts over time. The evidence he gathered showed that amongst the active Twitter accounts he measured, the median account has 61 followers and follows 177 users.
“If you’ve got a thousand followers, you’re at the 96th percentile of active Twitter users,” he noted at Radar . This data also enabled Bruner to make a widely- cited (and tweeted! ) conclusion: Twitter is “more a consumption medium than a conversational one–an only-somewhat-democratized successor to broadcast television, in which a handful of people wield enormous influence and everyone else chatters with a few friends on living-room couches.”
How did he do it? Python, R and MySQL.
“Every few minutes, a Python script that I wrote generated a fresh list of 300 random numbers between zero and 1.9 billion and asked Twitter’s API to return basic information for the corresponding accounts,” wrote Bruner. “I logged the results–including empty results when an ID number didn’t correspond to any account–in a MySQL table and let the script run on a cronjob for 32 days. I’ve only included accounts created before September 2013 in my analysis in order to avoid under-sampling accounts that were created during the period of data collection.”
A reporter that didn’t approach researching the dynamics of Twitter this way, by contrast, would be left to try the Herculean task of clicking through and logging attributes for 400,000 accounts.
That’s a heavy lift that would strain the capacity of the most well-staffed media intern departments on the planet to deliver upon in a summer. Bruner, by contrast, told us something we didn’t know and backed it up with evidence he gathered. If you contrast his approach to commentators who make observations about Twitter without data or much experience, it’s easy to score one for the data journalist.
Reverse engineering how Netflix reverse engineered Hollywood
Alexis Madrigal showed the accuracy of Klein’s prediction right out of the gate when he published a fascinating story on how Netflix reverse engineered Hollywood on January 2.
If you’ve ever browsed through Netflix’s immense catalog, you probably have noticed the remarkable number of personalized genres exist there. Curious sorts might wonder how many genres there are, how Netflix classifies them and how those recommendations that come sliding in are computed.
One approach to that would be to watch a lot of movies and television shows and track how the experience changes, a narrative style familiar to many newspaper column readers. Another would be for a reporter to ask Netflix for an interview about these genres and consult industry experts on “big data.” Whatever choice the journalist made, it would need to advance the story.
As Madrigal observed in his post, assembling a comprehensive list of Netflix microgenres “seemed like a fun story, though one that would require some fresh thinking, as many other people had done versions of it.”
Madrigal’s initial exploration of Netflix’s database of genres, as evidenced by sequential numbering in the uniform resource locator (URLs) in his Web browser, taught him three things: there were a LOT of them, organized in a way he didn’t understand, and manually exploring them wasn’t going to work.
You can probably guess what came next: Madrigal figured out a way to scrape the data he needed.
“I’d been playing with an expensive piece of software called UBot Studio that lets you easily write scripts for automating things on the web,” he wrote. “Mostly, it seems to be deployed by low-level spammers and scammers, but I decided to use it to incrementally go through each of the Netflix genres and copy them to a file. After some troubleshooting and help from [Georgia Tech Professor Ian] Bogost, the bot got up and running and simply copied and pasted from URL after URL, essentially replicating a human doing the work. It took nearly a day of constantly running a little Asus laptop in the corner of our kitchen to grab it all.”
What he found was staggering: 76,897 genres. Then, Madrigal did two other things that were really interesting.
First, he and Bogost built the automatic genre generator that now sits atop his article in The Atlantic, giving users something to p