When you sign up for a new social network, such as WhatsApp, you are often asked if you want to find out who among your contacts is already part of that social network (contact discovery). But if you don’t want to provide your full list of contacts, how do you know if any of them is on the social network without sharing your address book?
Countries around the world are developing apps to track COVID-19 infections down. In Spain, for example, the pilot COVID Radar was launched in the island of La Gomera at the end of June. These apps arouse many misgivings about privacy. Would it be possible to find out if you have been in contact with any infected person without either you or the server knowing exactly who it is?
Or imagine that a laboratory has discovered a drug against COVID-19 that works only with people who have certain genes. How can the laboratory know if you have those genes without you revealing your whole genome or the laboratory revealing what those specific genes are?
What Is Happening?
Big Data and cloud computing are giving rise to a multitude of situations where two parties each have a set of data that they want to share with the other part in order to find the intersection between the two sets. But only by revealing the data in common. This is an old problem of cryptography known as Private Set Intersection (PSI) that is experiencing a strong resurgence. In addition to the three scenarios already mentioned, there are many other use cases:
- A company develops a remote diagnostic app for the COVID-19 with extraordinary accuracy from a list of symptoms provided by the patient. The patient does not wish to reveal his symptoms to the company, nor does the company wish to reveal to anyone what symptoms it uses for diagnosis.
- The same person receives medical care in different locations. Different administrations want to know which patients have visited health centres in other communities without disclosing their list of patients to each other.
- In order to conduct an international operation, several national cybersecurity agencies want to find the intersection between their criminal IP databases without revealing their complete lists of IPs to each other.
- An advertising agency sends an online ad to a group of users. Some of these users subsequently buy the product in a physical shop. The agency wants to find the intersection between the group of users who saw the product ad and those who bought it in physical shops (online-to-offline ad conversions).
- One healthy food company serves meals to many employees of another company, which performs medical tests on them twice a year. The catering company wants to know if employees who have lowered their cholesterol in the last year consumed their food, but the other company does not want to (and should not) disclose its employees’ health data.
The more traction the cloud and Big Data gain, bigger the amount of new use cases arising every day: detection of botnets, identification of cheats in online games, sharing of locations, discovery of compromised credentials, etc
The need to make this intersection of sets in a private way has become crystal clear, but the question is: how do we achieve that? Cryptography offered numerous PSI techniques, from the hash-functioning naive solution to semi-confidential third-party protocols and protocols involving only two parties. Let’s have a quick look at how they work.
The Naïve Solution With Hash Functions
It consists of comparing the hashes of each element of both sets. If two hashes match, then an adjustment has been made. This approach, which was used by WhatsApp at the time, is simple and fast, but it is not safe because with a small data set or low entropy, such as telephone numbers, it is perfectly feasible to perform a brute-force attack, calculating the hashes of all possible elements. In this way, structured with the list of hashes of all the phones, WhastApp would not only know the contacts you share, but the phone numbers of all your contacts!
In the same way, this approach is not suitable for comparing ID cards, simple identifiers, names, etc. It only provides security when the data to be compared is random or have a high entropy.
PSI Based On Semi-Confidential Third Parties
Another, more solid approach is to pass each element of the assemblies through the same HMAC function with a secret key that the two parties, Alice and Bob, have agreed upon in advance. They send their randomised data to the third party, Trent, who returns the intersection set to each of them. Since Alice and Bob have each kept a private table with the outputs of the HMAC function for each data of their respective sets, they can search this table for settings and determine which elements they share.
The maximum information filtered to Trent is the cardinality of the intersection set, this is to say, how many elements Alice and Bob have in common; and the cardinality of Alice and Bob´s sets, since Trent will know whether Alice has more elements in her set than Bob or vice versa. Of course, Trent could turn out to be malicious and try to deceive Alice and Bob, for example, by returning a different intersection set to the real one. Fortunately, there are simple adjustments to this protocol to avoid this type of manipulation by Trent. What you will never discover is Alice and Bob’s data.
PSI Based On Two Parties
What if you don’t want to depend on a third party? No problem. There are many alternative approaches in which only the two parties involved who want to find the intersection between their sets interact.
One of the first and conceptually simpler approaches proposed is based on the cryptographic premises of the famous Diffie-Hellman protocol for agreeing session keys between two parties through an insecure channel. In this case, Alice and Bob apply the DH protocol to share one session key for each data in their respective sets. Any shared key found in both sets indicates that the corresponding element is a member of the original sets of both parties. Let’s see how this works in detail:
- Alice and Bob agree on a large prime number, p, using a public channel
- Alice randomly generates a private key, a.
- Alice calculates the hash, gi, of each of the values of her original set. In fact, this step is a bit more complicated, since the hash must be repeated until g is a primitive root mod p, but we won’t go into the mathematical details.
- For each of these gi values, Alice calculates the gia value mod p.
- Alice sends these values to Bob.
- Bob randomly generates a private key, b.
- Bob repeatedly calculates the hash, hi, of each of the values of his original set, until they are primitive roots mod p.
- For each of these hash values, Bob calcula hib mod p.
- Bob calculates the shared keys corresponding to each element of Alice’s original set, raising the values received from Alice to the power of his private key, this is to say, giab mod p.
- Bob sends his calculated values, hib mod p, to Alice, as well as the calculated shared keys corresponding to the elements of Alice’s original set, giab mod p.
- Alice calculates the shared keys corresponding to each element of Bob’s original set by raising the values received from Bob to the power of his private key, this is to say, hiba = hiab mod p, for each of the values received from Bob.
- Alice compares the shared keys calculated from the elements of her own original set, giab, with the shared keys calculated with the elements of Bob, hiab. The intersection consists of those elements of Alice’s original set whose shared key can also be found in the set of shared keys calculated from the elements of Bob’s original set, giab = hiab.
Since the publication of this protocol, dozens of alternatives using increasingly sophisticated cryptographic primitives have appeared. Highly elaborate protocols based on other public key algorithms have been proposed, such as blind RSA operations; based on Bloom filters; on fully homomorphic encryption; on oblivious transfer (OT) to transmit set data; or on variants of Yao’s Garbled Circuit, capable of simulating any mathematical function with a Boolean circuit using only AND and XOR logic gates.
Security Challenges and PSI Scalability
The security and scalability challenges faced by all these protocols in calculating the private intersection of sets are varied:
- The most efficient protocols work for small sets of a few hundred or thousands of elements. However, in many real applications, sets of billions of data need to be compared, which requires finding faster alternatives.
- In addition to requiring only few operations to function, it is important to minimise communications for data exchange between the parties.
- Not all agents involved will play by the rules. In secure multiparty computing, two types of opponents are considered: semihonest (or passive) and malicious (or active). The semihonest opponent tries to obtain as much information as possible from the execution of a certain protocol, without drifting from the steps of the protocol. More dangerous and realistic is the malicious opponent, because he arbitrarily drifts away from the protocol steps to take advantage and obtain more information than the others. PSI protocols that are resistant to malicious opponents are considerably heavier and less efficient than protocols that are resistant to semihonest opponents.
- The simplest PSI approaches filter information: at a least, the number of elements in each set and the number of elements in the intersection set. In applications where it is not even acceptable to filter this information, more secure protocols are required, which unfortunately require more operations and more bandwidth.
The More the Cloud and Big Data Advance, the Greater the Demand for PSIs
As data protection laws and regulations evolve in an effort to safeguard the private sphere of citizens’ lives, the private intersection of sets will enable public and private organisations to continue to generate knowledge from the Big Data that benefits the citizen, while satisfying privacy regulations.
In June 2019, Google announced a tool to perform operations on the intersection of sets called Private Join and Compute. According to the press release:
Using this cryptographic protocol, two parties can encrypt their identifiers and associated data and then join them in a consultation. They can then make certain types of calculations on the overlapping data set to obtain useful information from both data sets together. All entries (identifiers and their associated data) remain fully encrypted and unreadable throughout the process. None of the parties ever reveals their raw data, but they can still answer the questions raised using the calculation output. This result is the only thing that is decoded and shared in the form of aggregated statistics such as a count, sum or average of the data from both sets.
Private Join and Compute combines the private intersection of sets with full homomorphic encryption to protect individual data. The following video gives an idea of how it works:
PSI represents the intersection between the voracity of data from large organisations and the right to privacy of citizens. Technological giants such as Google, Microsoft or Baidu are investing enormous amounts of money and cryptographic neurons in these technologies. In the coming months we will see where mass data analysis applications turn, whether to favour citizens with better services or to further reduce their battered privacy. After all, as the cryptographer Phil Rogaway said:
“Surveillance that preserves privacy is still surveillance.”