Another note on differential privacy
Win-Vector Blog 2016-07-01
I want to recommend an excellent article on the recent claimed use of differential privacy to actually preserve user privacy: “A Few Thoughts on Cryptographic Engineering” by Matthew Green.
After reading the article we have a few follow-up thoughts on the topic.
Our group has written on the use of differential privacy to improve machine learning algorithms (by slowing down the exhaustion of novelty in your data):
- A Simpler Explanation of Differential Privacy: Quick explanation of epsilon-differential privacy, and an introduction to an algorithm for safely reusing holdout data, recently published in Science (Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth, “The reusable holdout: Preserving validity in adaptive data analysis”, Science, vol 349, no. 6248, pp. 636-638, August 2015). Note that Cynthia Dwork, one of the inventors of differential privacy, originally used it in the analysis of sensitive information.
- Using differential privacy to reuse training data: Specifically, how differential privacy helps you build efficient encodings of categorical variables with many levels from your training data without introducing undue bias into downstream modeling.
However, these are situations without competing interests: we are just trying to build a better model. What about the original application of differential privacy: trading modeling effectiveness against protecting those one has collected data on? Is un-audited differential privacy an effective protection, or is it a fig-leaf that merely checks off data privacy regulations?
A few of the points to ponder:
- How do you know a larger vendor (such as Apple or Google) is actually using differential privacy in a meaningful way? Because they claim so?
- Differential privacy has an in-built control parameter that trades privacy protection against modeling quality. Where do you think an un-audited entity will set that trade-off?
- Differential privacy is typically achieved by noising data or by noising query results. Noising data is better for the user (though some of the common realizations can be broken if the same entity observes the database at two different times). Noising queries is better for the data consumer: but is incredibly vulnerable to repeated query attacks, coordinate query attacks and simple subpoenas.
- Noising data during collection (in my opinion the only current likely effective realization of such privacy, still waiting to see how far and how practical homomorphic encryption will turn out) looks a lot like classic randomized response procedures. Randomized response procedures are (for very good reason) sold as being under respondent control: responders are asked to essentially answer yes to the inclusion of “are you in this unpopular category and/or did a coin you privately flipped come up heads?” The industrial version of this is: your mobile phone is going to collect everything about you and there will be a click-through license claiming some of this may be noised up on your behalf.
We’ll end with: we think the applications of differential privacy techniques to improving machine performance are still the most promising applications as they don’t have the difficulty of trying to serve competing interests (modeling effectiveness versus privacy). A great example of this is the fascinating paper “The Ladder: A Reliable Leaderboard for Machine Learning Competitions” by Avrim Blum, Moritz Hardt. I’d like to think it is clever applications such as the preceding drive current interest in the topic of differential privacy (post 2015). But it looks like all anyone cares about is the Apple’s announcement.
Google trends: “differential privacy.” Spike is likely the Apple announcement.What I am trying to say: claiming the use of differential privacy should not be a “get out of regulation free card.” At best it is a tool that can be part of implementing privacy protection, and one that definitely requires ongoing detailed oversight and auditing.