Debunking the Myth of “Anonymous” Data

Deeplinks 2023-11-10

Summary:

Today, almost everything about our lives is digitally recorded and stored somewhere. Each credit card purchase, personal medical diagnosis, and preference about music and books is recorded and then used to predict what we like and dislike, and—ultimately—who we are.

This often happens without our knowledge or consent. Personal information that corporations collect from our online behaviors sells for astonishing profits and incentivizes online actors to collect as much as possible. Every mouse click and screen swipe can be tracked and then sold to ad-tech companies and the data brokers that service them.

In an attempt to justify this pervasive surveillance ecosystem, corporations often claim to de-identify our data. This supposedly removes all personal information (such as a person’s name) from the data point (such as the fact that an unnamed person bought a particular medicine at a particular time and place). Personal data can also be aggregated, whereby data about multiple people is combined with the intention of removing personal identifying information and thereby protecting user privacy.

Sometimes companies say our personal data is “anonymized,” implying a one-way ratched where it can never be dis-aggregated and re-identified. But this is not possible—anonymous data rarely stays this way. As Professor Matt Blaze, an expert in the field of cryptography and data privacy, succinctly summarized: “something that seems anonymous, more often than not, is not anonymous, even if it’s designed with the best intentions.”

Anonymization…and Re-Identification?

Personal data can be considered on a spectrum of identifiability. At the top is data that can directly identify people, such as a name or state identity number, which can be referred to as “direct identifiers.” Next is information indirectly linked to individuals, like personal phone numbers and email addresses, which some call “indirect identifiers.” After this comes data connected to multiple people, such as a favorite restaurant or movie. The other end of this spectrum is information that cannot be linked to any specific person—such as aggregated census data, and data that is not directly related to individuals at all like weather reports.

Data anonymization is often undertaken in two ways. First, some personal identifiers like our names and social security numbers might be deleted. Second, other categories of personal information might be modified—such as obscuring our bank account numbers. For example, the Safe Harbor provision contained with the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that only the first three digits of a zip code can be reported in scrubbed data.

However, in practice, any attempt at de-identification requires removal not only of your identifiable information, but also of information that can identify you when considered in combination with other information known about you. Here's an example:

First, think about the number of people that share your specific ZIP or postal code.
Next, think about how many of those people also share your birthday.
Now, think about how many people share your exact birthday, ZIP code, and gender.

According to one landmark study, these three characteristics are enough to uniquely identify 87% of the U.S. population. A different study showed that 63% of the U.S. population can be uniquely identified from these three facts.

We cannot trust corporations to self-regulate. The financial benefit and business usefulness of our personal data often outweighs our privacy and anonymity. In re-obtaining the real identity of the person involved (direct identifier) alongside a person’s preferences (indirect identifier), corporations are able to continue profiting from our most sensitive information. For instance, a website that a

Authors:

Paige Collings

Date tagged:

11/10/2023, 11:08

Date published:

11/10/2023, 08:49