Paper Machines

metaLAB (at) Harvard 2012-07-17

I have had the good fortune to work at metaLAB this summer on an open-source tool for text analysis and visualization in the digital humanities. This effort, funded through the Google Summer of Code, is taking place under the tutelage of metaLAB’s own Matthew Battles and the historian and Harvard Junior Fellow Jo Guldi, who will be joining Brown University’s faculty in the fall.

Jo’s project is one of remarkable scope: to chart the history of land reform across the globe, making use of texts and archival data spanning more than a century. The spatial, temporal, and intellectual diffusion of land reform can already be traced in outline, thanks in large part to the scholars and archivists of prior generations who have assembled numerous bibliographies, archives, monographs and glossaries in their attempts to come to grips with the myriad outputs of “paper machines”: colonial administrations, government ministries, NGOs, utopian social movements, academic institutions, and other producers of texts dealing with land and its (re)distribution. But to look both more broadly at and more deeply into the data we have, to find the subtle patterns at unfathomable scales that are the digital humanities’ raison d’être, it is necessary to build new tools that can leverage the best extant algorithms in service of our human powers of perception and intuition.

As I discovered, much of the groundwork for this effort has already been laid— in some cases, several times over. One widely-used algorithm for analyzing topics in unstructured text, latent Dirichlet allocation, already has a variety of open-source implementations. Projects from computer scientists at Brigham Young and the University of Iowa have sought to make this and other so-called topic models more easily accessible to humanities researchers. Rather than duplicating these efforts, I have focused on blending the best features of these applications and extending them for the purposes of Jo’s work and my own.

Jo initially presented me with a curated corpus of documents stored in the excellent open-source bibliographic manager Zotero, for which I created an adapter to interface easily with the Django web framework used by the aforementioned topic modeling tools. I have also augmented her collection with a dataset from JSTOR’s Data for Research, an initiative made public in 2009 that provides n-gram and citation data from JSTOR’s database for up to 1,000 documents at a time. Equipped with this fortified corpus, I have used topic models, geoparsers, and citation extractors – our 21st-century paper machines, now at the command of any computer user – to construct new configurations and cross-sections of its data, revealing the transformation of its thematic content, geographical distribution, and social/institutional networks over time.

Below is a screenshot from the “topic” view of our application, showing the output of a topic model analysis. Such an analysis assumes that texts are comprised of various collections of words (“topics”) according to a probability distribution; the algorithm infers which words make up which topics and which topics make up which documents based solely on word co-occurrence, without prior semantic knowledge. These topics are machine-generated, but often reveal a surprising coherence; for example, the “estate, tenant, law” topic above also includes “property, right, rent, lease, heir, inclosure” and so on. The topics below give a sense of some of the changing concerns in two subsets of the corpus:

A comparative graph of two subsets of documents

Four topics shown for all documents mentioning India (top) and/or Ireland (bottom).

At present, the program allows for queries by time period, document title, or location, displaying the proportion of different topics in the corpus over time. Full-text search and tighter integration of topic labels are forthcoming, as well as the ability to compare topic proportions within queries against the baseline of the whole corpus.

I look forward to developing Paper Machines further, especially by incorporating confirmatory statistics that will validate model output against the expert human classification of texts, creating functions to suggest interesting juxtapositions of search terms and topics, and developing narrative visualizations that may, for example, dramatize the spread of land reform from continent to continent as a series of tectonic shifts.

My own research as a Ph.D student at Brown focuses on the intertwining of technology with race, gender, and sexuality in contemporary African-American music and online discourse on sites like Twitter, Tumblr, Okayplayer and Rapgenius. While my subject is superficially quite different from Jo’s, I believe this tool will help me to condense the massive corpora of these online fora into more manageable forms. I am excited to see what other uses may emerge from this project once it is released to the broader community.