In order to customize their products and services and offer increasingly better features that make them more valuable and useful, companies need to know information about their users. The more they know about them, the better for them and better (allegedly) for their users. But of course, much of this information is sensitive or confidential, which represents a serious threat to users’ privacy.
So, how can a company know everything about its customers and at the same time not know anything about any particular customer? How can their products provide great features and great privacy at the same time?
The answer to this paradox lies in ‘differential privacy’: learning as much as possible about a group while learning as little as possible about any individual within it. Differential privacy allows obtaining knowledge of large data sets, but with a mathematical proof that no one can obtain information about a single individual of the set. Thanks to differential privacy you can know your users without violating their privacy. First of all, let’s see the threat to privacy of large data sets.
Neither Anonymity nor Great Queries Ensure Privacy
Imagine that a hospital keeps records of their patients and gives them to a company to make a statistical analysis of them. Of course, they delete personally identifiable information, such as name, surname, ID, address, etc. and only keep patients’ birth date, sex and zip code. What could go wrong?
In 2015, the researcher Latanya Sweeney organized a re-identification attack on a set of hospital records. Hold on, because from newspapers stories she was able to personally identify (with names and surnames) 43% of the patients from the anonymized database. Actually, she claimed that 87% of the US population is uniquely identified by their birth date, gender and zip code.
As you can see, the techniques of database anonymization fail miserably. In addition, the more anonymized a database is (the more personally identifiable information has been deleted), the less useful it is.
And if only queries on large volumes of data and not on specific individuals are allowed? The ‘distinguishing attack’ deals with this case: let’s imagine it is known that Mr. X appears in a given medical database. We launch the following two queries: ‘How many people suffer from sickle cell anemia?’ and ‘How many people without the name X suffer from sickle cell anemia?’ Together, the answers to the two queries show the sickle cell state of Mr. X.
According to the Fundamental Law of Information Recovery:
‘Overly accurate answers to too many questions will destroy privacy in a spectacular way.‘
And do not think that banning this pair of questions avoids distinguishing attacks, since the simple fact of rejecting a double query makes it possible information leakage. Something more is required to ensure privacy and, at the same time, to be able to do something useful with databases. There are different proposals to achieve differential privacy. Let’s start with a very simple technique used by psychologists for over 50 years.
Do You Want Privacy? Add Noise
Imagine that I want to get the answer to an embarrassing question: have you ever scarfed a can of dog food? As it is a delicate matter, I propose answering as follows:
- Flip a non-trick coin.
- If it’s heads, flip the coin again and, whatever you get, say the truth.
- If it’s tails, then flip it again and say ‘yes’ if it’s heads and ‘no’ if it’s tails.
Now your confidentiality is safe because no one can know if you answered the truth or if you selected a random result. Thanks to this randomization mechanism, plausible deniability has been achieved: even if your answer is seen, you can deny it and no one could prove otherwise. Actually, if you asked yourself why then the coin is flipped an extra time in the first case if it is not taken into account later, it is in order to protect you in situations where you may be watched while flipping the coin.
And what about the accuracy of the study? Is it useful considering all the random data? The truth is that it is. As the statistical distribution of the results of flipping a coin is perfectly known, it may be removed from the data without any problem.
|Be careful! Math! Don’t keep reading if you can’t stand equations. A quarter of positive responses are given by people who do not eat their dog’s food and by three quarters of those who do. Therefore, if p represents the ratio of people who scarf cans of dog food, then we expect to get (1/4)(1-p)+(3/4)p positive responses. Consequently, it is possible to estimate p. And the more people are asked, the closer the calculated value of p will be to the real value.|
As it happens, this idea (with some additional complication) was adopted by Google in 2014 for its RAPPOR (Randomized Aggregatable Privacy-Preserving Ordinal Response) project. According to Google, “RAPPOR provides a new and modern way to learn software statistics that we can use to better safeguard the safety of our users, find errors and improve the overall user experience”.
Of course, while protecting users’ privacy. Or so they say. The good point is that you can examine the RAPPOR code on your own to verify it.
Differential Privacy Beyond Randomized Responses
Randomized responses are a simplified way to achieve differential privacy. The most powerful algorithms use Laplace distribution to spread noise throughout all data and thus increase the level of privacy. And there are many others, included in the free download book The Algorithmic Foundations of Differential Privacy. What all of them have in common, though, is the need to introduce randomness in one way or another, typically measured by a parameter ε, which may be as small as desired.
The smaller ε, the greater the privacy of the analysis and the lower the accuracy of the results, since the more information you try to query your database, the more noise you need to inject in order to minimize the leakage of privacy. This way, you will be inevitably facing a fundamental compromise between accuracy and privacy, which may be a big issue when complex Machine Learning models are being trained.
And what is even worse: no matter how small ε is, every query leaks information, and by each new query, the leak becomes larger. Once you cross the privacy threshold that you had preset, you cannot go ahead or you will start leaking personal information. At that point, the best solution may be to simply destroy the database and start over, which seems hardly feasible. Therefore, the price to pay for privacy is that the result of a differentially-private analysis will never be accurate, but an approximation with expiration date. You cannot have it all!
Or maybe you can? Fully homomorphic encryption and secure multi-party computation allow 100% private and 100% accurate analysis. Unfortunately, these techniques are currently too inefficient for real applications of the magnitude of Google’s or Apple’s.
Too Pretty to Be True: Where Is the Trick?
Since in 2016 Apple announced that iOS 10 would include differential privacy, the concept has moved from cryptographers’ boards to users’ pockets. Unlike Google, Apple has not released its code, so it cannot be known exactly what type of algorithm it uses or if this is used with guarantees.
In any case, it seems a positive sign that giants like Google and Apple take steps, even if shy, in the right direction. Thanks to cryptography, you have at your fingertips resources to know your users and at the same time safeguard their privacy. Let us hope that the use of these algorithms will become popular and other giants, such as Amazon or Facebook, will also start to implement them.