Announcing the Princeton-Leuven Longitudinal Corpus of Privacy Policies

Freedom to Tinker 2020-03-06

We are releasing a reference dataset of over 1 million privacy policy snapshots from more than 100,000 websites, spanning over two decades.

By Ryan Amos, Elena Lucherini, Gunes Acar, Jonathan Mayer, Arvind Narayanan and Mihir Kshirsagar.

Automated analysis of privacy policies has proved useful in several research efforts, leading to results such as interactive deep-learning based policy summaries and compliance detection. These studies have highlighted the need for more sophisticated methods and data.

The analyses so far have been limited to a single point in time, or to short spans of time, as researchers didn’t have access to a large-scale longitudinal dataset that can be used to study how privacy policies have changed with time. 

To address this gap, we are releasing a dataset of over 1 million privacy policies collected from the Internet Archive’s Wayback Machine. To build this dataset, we developed a custom crawler that detects and downloads privacy policies from archived web pages. We processed the downloaded policies to clean up error pages, extract the text of the privacy policies, and filter out non-policy documents using machine learning.

Data Overview

This dataset contains 1 million English-language privacy policy snapshots from over 100,000 distinct websites chosen from the Alexa Top 100K from 2009-2019. In addition to sanitized privacy policy text and raw webpage HTML, the dataset includes metadata such as the archival time and the website URL that the policy belongs to. Although the dataset contains policies from as early as the late 1990s, more than 90% of the policies are from 2007 or later.

Obtaining access

Please send an email to privacy-policy-data@lists.cs.princeton.edu stating your name and affiliation.

Since we are finalizing the data schema, format, and metadata, we would like to hear your specific requirements, if you have any.