My Career as a Bulk Downloader
The Laboratorium 2013-03-30
The core of the case against Aaron Swartz was that he downloaded millions of academic articles from JSTOR without permission. He did so by sneaking into an MIT wiring closet and evading MIT’s and JSTOR’s attempts to detect and block him. But the heart of the case, the conduct without which there would have been no point and no problem, was the downloading.
To put this in perspective, I, too, am a bulk downloader. James has downloaded his thousands, and Aaron his ten thousands. And there but for the grace of the Assistant United States Attorneys (who wield god-like prosecutorial power), go I.
In law school, during my time at the Yale ISP, I wrote for and ran LawMeme, a blog about law and technology. (Here’s one of its greatest hits, Ernie Miller’s classic “Top Ten New Copyright Crimes”.) It was a Slashclone based on PHP-Nuke, and it ran from roughly 2001 to 2006 before succumbing to script kiddie penetration attacks, a lack of new content, and administrative neglect. The domain names expired, the content-management engine was hacked beyond repair, and the powers that be ultimately made the sensible decision to pull the plug and not to try reviving it.
But this meant losing an archive of about fifteen hundred posts. I had a strong personal attachment to some, like the post that would ultimately become Accidental Privacy Spills. Others, like my posts on the Search King lawsuit, were the first draft of history. Ernie’s posts on the copyright disputes of the early oughts were memorable, vivid pieces of writing that deserved to be saved.
So I took on the task of making a static archive of what could be salvaged from LawMeme. LawMeme itself had been dynamically generated: each page was assembled from various chunks of content thrown together by the server on the fly. The archive would consist simply of fixed, unchanging webpages. There’s no good index to them, but if you search for “LawMeme” and any of the topics we wrote about, you’ll see articles that look more or less as they did back in the site’s heyday.
But to create the archive, I couldn’t just go back to the long-defunct LawMeme site itself. Instead, I had to turn to the Internet Archive’s Wayback Machine, which keeps snapshots of webpages from over the years. But with well over a thousand posts to retrieve, I didn’t want to sit there copying by hand.
And so I became a bulk downloader. I wrote a Perl script: a simple, 70-line program that exhaustively went through the Wayback Machine, looking for a copy of each LawMeme article. Just like Aaron’s script, mine “discovered the URLs” of articles and then downloaded them. And just to show how mainstream this is, I’ll add that I built my script around an elementary one that Paul Ohm published in “Computer Programming and the Law: A New Research Agenda,” his manifesto for why more law professors should write code. Paul’s script downloaded and analyzed the comment counts on posts from the popular legal blog The Volokh Conspiracy.
I think this was completely legal. But in today’s environment of fear and prosecutorial intimidation, who can be sure? I own the copyright in my own posts, I had the permission of the ISP to create the archive, and the implied license that all of the contributors gave to LawMeme would almost certainly cover this backup. But almost certainly is not absolutely certainly. Maybe some AUSA wants to build a career taking down professors, putting me in the crosshairs.
Or take the Internet Archive’s terms of service. By using the site, I supposedly promised not “to copy offsite any part of the Collections without written permission.” The site’s FAQ qualifies this statement a bit, adding, “However, you may use the Internet Archive Wayback Machine to locate and access archived versions of a site to which you own the rights.” Again, I was confident that this covered me. But confidence is not certainty. I assumed that no one would care to press the question. After Aaron, is that such a safe assumption?
I can’t imagine that the Internet Archive would have a problem with what I did. Recreating lost websites for the sake of the public and posterity is completely consistent with Brewster Kahle’s expansive humanist vision of digital archiving. But JSTOR quickly made its peace with Aaron, and that didn’t save him. Would Brewster’s blessing save me from the wrath of the feds?
Indeed, my script waited a second between each download. I didn’t want to put too much of a load on the Archive’s servers. But a cyber-Javert could describe it as an attempt to evade detection. Then, to get the webpages to display right in the LawMeme archive, I wrote another script to delete the bits of HTML added by the Internet Archive to the pages in its archive. Was that an effort to hide my tracks?
Another one of Paul’s papers presciently predicted the way our computer misuse statutes were vindictively turned against Aaron. In The Myth of the Superuser, Paul describes how these laws are written to protect against a mythic bogeyman, the all-powerful demented superhacker, capable of breaking into and destroying any computer system, bent on sowing chaos and devastation online. But the laws are used to punish minor misdeeds by unthreatening defendants. Imagine Mr. McGregor training a howitzer on Peter Rabbit and you have the idea.
Aaron’s Law is a start, but the problems with our computer crime laws, and with criminal law in general, run much, much deeper. The Department of Justice thinks millions of parents who made Facebook accounts for their children are federal criminals. Read the majority opinion in United States v. Nosal and ask yourself whether you’ve fudged your age on a dating site, or let someone else use your account, or used a workplace computer to check the baseball scores. Judge Kozinski noted, skeptically, “The government assures us that, whatever the scope of the CFAA, it won’t prosecute minor violations.” Tell that to Aaron’s family.
I am Aaron Swartz-icus, and so are you.