A simple differentially private procedure
Win-Vector Blog 2015-10-13
Authors: John Mount and Nina Zumel
Nina and I were noodling with some variations of differentially private machine learning, and think we have found a variation of a standard practice that is actually fairly efficient in establishing differential privacy.
Read on for the idea and a rough analysis.
The idea
A commonly discussed step in establishing differential privacy is to add some Laplace distributed noise to queries. It works (when used in conjunction with other steps), but it can seem mysterious. We think bootstrap resampling should be considered more seriously as a component in differential privacy (despite some references claiming it is not enough on its own).
The method
Consider the following alternate technique to defend the test set in a differential privacy scheme.
Let n
be the number of examples in the set you are using “as test.” Pick a positive real number Z < n
. Think of Z
as taking on a value like 10.
When it comes time to compute a noisy test-set score instead score a bootstrap re-sample of the test set of size ceiling(n/Z)
where n
is the size of your actual test set. Each time a score is asked for: re-do the Bootstrap. This re-sampling introduces empirical sampling noise (a standard trick) and by varying the size of Z
we can vary the amount of noise (similar to varying the Laplace distribution parameter in Laplace noise addition).
The question is: does this establish differential privacy, and does it do it while introducing only a reasonable amount of variance? Unless we got our calculations wrong we think it does establish differential privacy and at a cost that is competitive with the Laplace noise techniques.
We work only one example, but the calculation gives the idea.
The argument
Suppose in a differential privacy proof the adversary submits two sets: one that is n
zeros, and one that is n-1
zeros and one one. The query is what fraction of the rows are non-zero. We claim a bootstrap re-sample of size ceiling(n/Z)
roughly establishes differential privacy with epsilon = 1/Z
and a variance of Z/(epsilon n^2)
for Z,epsilon,n
in an appropriate range.
Consider what happens when sampling from the set with the 1-row. The number of times the unique 1-row is copied into the bootstrap sampling is Poisson distributed with expected value of 1/Z
. This follows from the linearity of expectation: we have a sum of n/Z
rows in the bootstrap sample each of which has an expected value of 1/n
.
Estimating single-query differential privacy
By Markov’s inequality we know P[count ≥ 1] ≤ E[count]
. So it follows the 1-row shows up at all in the bootstrap set with probability no more than 1/Z
. As the presence of this row is the only way to tell the sets apart we have epsilon
differential privacy with epsilon = 1/Z
(pretty much by definition). Or: with Z=1/epsilon
we can hope for epsilon
differential privacy.
Estimating the variance
Now consider the variance of the frequency estimate we are returning. Because the 1-row count is a Poisson process we know it has variance equal to its mean. So the 1-row count is a random variable with mean 1/Z
and variance 1/Z
. Frequency is count/setSize
which is count/(n/Z)
. So the frequency estimate is a random variable with mean (1/Z)/(n/Z) = 1/n
and variance (1/Z)/(n/Z)^2 = Z/n^2
. Substituting in Z = 1/epsilon
we get variance of the frequency estimate of 1/(epsilon n^2)
.
The result
So we should expect this re-sized bootstrap scheme with Z=1/epsilon
to achieve epsilon
differential privacy at a cost of introducing 1/(epsilon n^2)
units of variance. From reading references this seems like a favorable privacy/variance trade.
Conclusion
It may be possible to replace Laplace noise methods with re-sized Bootstrap resampling in various differential privacy establishing algorithms. Obviously bootstrapping is a fairly standard technique and others have noted its relation to differential privacy (examples: Ji Z, Elkan C. Differential privacy based on importance weighting. Machine learning. 2013;93(1):163-183. doi:10.1007/s10994-013-5396-x link; “A Bootstrap Mechanism for Response Masking in Remote Analysis Systems”, Krishnamurty Muralidhar, Christine M. O’Keefe, and Rathindra Sarathy, Article first published online: 14 SEP 2015, Decision Sciences DOI: 10.1111/deci.12168 link). But perhaps the bootstrapping method (especially with the change of set size) deserves to be used more often.