Upgrade 2021: cis LAB Speakers

September 21, 2021 // Upgrade 2021: CIS LAB Speakers

Recent Developments in Cryptography

Dan Boneh, Professor, Stanford University

Transcript of the presentation Recent Developments in Cryptography, given at the NTT Upgrade 2021 Research Summit, September 21, 2021.

Can we actually learn aggregate information about a user population without learning anything else about user data? Yes, we’d like to collect information and aggregate, but not learn anything about individual user data. And there are lots of examples where this comes up. For example, if you are shipping a product, you might be interested in learning how the population uses the product, but you’re not interested in learning how particular individuals use the product. For example, if you’re shipping a car, I guess in the past the car manufacturer would never get to see the car again once it leaves the factory. But of course these days cars are connected and they constantly can report back to the manufacturer, and the manufacturer can use that to improve the product. So for example, the manufacturer might be interested in learning how many people drive with their windows down, or how many people drive with their tires under-inflated and so on.

And once they learn this aggregate information, they can use that to improve the product. And again, the point is that they should learn the aggregate statistic, but they shouldn’t learn who is driving with their windows down, but who is driving with under-inflated tires. Similarly, web browsers might be interested in learning how people use the browser. And in fact, Mozilla actually uses the type of systems that I’m going to tell you about to actually collect aggregate information about their users without learning anything specific about a particular user. Yeah, so we can use that to improve products and do it in a way that doesn’t reveal anything about the user. We can also ask more complicated questions like, what are the most common web browser homepages that people set their homepages to? So we only want to learn about say the top 100 common homepages, but we don’t want to learn anything else about anybody else.

And similarly, you can ask questions about like, how many people are currently on the Golden Gate Bridge, but we don’t want to know who is on the Golden Gate Bridge. We just want to know aggregated information about the population. Yeah so these kinds of questions come up all the time. And the question is, how do we do this aggregation privately and efficiently? I have to say that today, a lot of the common way that this information is collected is basically by having the products send all the information back to the server, and then the server does the aggregation. Unfortunately, this means that the server gets to see a lot more information than it actually needs to. It’s only interested in aggregate data, but in fact, it gets to see everybody’s data in the clear. So we’d like to do something better that actually protects privacy and actually allows companies to learn information that it might not be able to learn if they were forced to see everything in the clear.

So how do we do it? Okay, so what we’re going to do is I’m going to use Mozilla as the example because they’re supposedly using a system like this. And so the server is interested, like Mozilla is interested in learning some aggregate information. We’re going to introduce a second party called a helper. And we’re going to assume that this helper actually is what’s called a non colluding party. And I’ll talk more about non colluding parties in just a minute. The way the system is going to work at a very high level is if you have data on your phone, what you will do is you will actually secret share this data between the server and the helper. So you produce two random numbers that happen to sum up to your data. So the left share plus the right share is equal to your data. You send the left share to the server and the right share to the helper. Neither party on its own gets to see any useful information about the actual data, they just see kind of random junk.

And basically every user is going to do this. Everybody splits their data between the server and the helper, in such a way that neither server nor helper get to see the data in the clear. So one thing that’s interesting here is you might be wondering, well, what does this helper like? Why are we sending information to the helper? It turns out the helper can really be set up in such a way that it really only receives truly random data. It sees nothing that’s derived from actual user data. And so you ask, well, what is this helper is really there? The only purpose of the helper is to make sure that the server is not learning information about smaller sets than are allowed. So it’s kind of auditing the server to make sure that the aggregation happens over a large enough set and the server is not aggregating over a smaller set than it’s allowed.

And you’ll see that in just a second. Okay, so now once we’ve collected information from all the users, possibly for millions, or maybe even billions of users that we got information from, the server has the left shares, the helper has the right shares. What they’ll do is basically they’re going to run what we call a very lightweight two party computation. And the reason this needs to be a very lightweight computation is again, because we want to scale this to billions of users and require very low computing resources on the server and the helper. And basically they learn the result of the analysis and nothing else about the data. So I want to give you a few examples of things that we can do today, and there are many more things that are coming. And so this is kind of a cool area to work in because there’s real industry demand for it.

Maybe even NTT wants to use these sorts of things. And so there’s a lot more work to do here. So let’s start with the simplest example where what the server is interested in is basically computing a subsets sum over the data that the users send over. For example, maybe we want to know only for people in California, how many of them drive with their windows rolled down? So this would be a subset sum over a subset of the population. So I imagine many of you can already see a very simple protocol to do it. Basically what the server will do is it will send the description of the set over to the helper. The helper, again, it’s there to verify that the set is authorized. It’s large enough, and the aggregation is happening over a diverse enough population. If the set is actually sufficiently diverse, then what the helper will send back is the sum of all the right hand shares in the set that the server requested.

Optionally, the helper can also add differential privacy noise to this, but I won’t talk about much about the differential privacy aspect of this. So now the server receives basically a subset sum of the right-hand shares, it adds the corresponding left-hand shares to this subset sum. And what it learns is basically the sum, again, potentially with differential privacy is added in. Okay, so that’s a very simple system for learning sums. So for example, if you want to learn how many people are on the Golden Gate Bridge, everybody in California can send the bits, whether it’s one, if they’re currently on the Golden Gate Bridge and zero otherwise. You can sum over the population, you will learn who is on the Golden Gate Bridge, you will learn the number of people on the Golden Gate Bridge, but you will not learn who is on the Golden Gate Bridge.

Now it turns out there are interesting challenges around the system like this. For example, users might send incorrect shares. One user might send sort of an invalid secret shares to both the server and helper, and that would completely destroy this summation process. And so the challenge is to convince the server and the helper that all the submitted data is well-formed, and that is what is done in the Prio system.   Prio enables a client to convince the servers that the data that they sent is well formed.  This ensures that when the servers calculate the aggregate, that aggregate will be correct.  Prio is captured in a sequence of works.  The first is by Henry Corrigan-Gibbs and myself.  We generalized Prio in a follow-up work with Elette Boyle, Niv Gilboa, and Yuval Ishai.   I want to also mention another system, called Prio+, by Addanki et. al. that uses less communication, but the servers have to work harder to compute the aggregate.

So that is the very simplest example for just computing sums, but there are much more interesting things we might be interested in computing. For example, we might want to compute what are called heavy hitters. So here, the goal is to produce basically the set of all data elements that were sent in by at least T users. Okay, so for example, maybe we want to learn the most popular URLs, but we only want to learn the popular URLs. We don’t want to learn URLs that only one or two people went to, that will be quite privacy invasive. Okay, so here again, we’re going to design a very lightweight protocol that allows a server and a helper to construct the heavy hitters. And the server basically learns the T heavy hitters and how many times each of them appeared, but they learned nothing else about the long tail of the distribution, which potentially is very privacy sensitive.

So this is a recent paper that we published at the IEEE Security and Privacy conference. I hope this actually matches people’s needs at NTT. And if so, come see us. There’s actually a lot more to say. And now I’ll describe how this protocol works in just a second. The third example I want to give you is what’s called a shuffle, where again, here the server and the helper share the data just as we said before. The purpose here is to introduce a very lightweight protocol, such that at the end, the server learns a permutation of the data. So it learns the total set of items that were sent by the clients, but it doesn’t learn which item belongs to who. And again, this is useful if you want to compute, whoops. This is useful if you want to compute things like histograms, an entire histogram, not just heavy hitters, but an entire histogram of the data.

And so again, you will learn a histogram, but you will not learn who did what. So this is joint work with my student Saba Eskandarian. It’s building on earlier work of Chase, Ghosh and Poburinnaya. So hopefully I’ll have time and I’ll be able to describe this protocol as well. It’s kind of a cute way to construct these histograms. Okay, so that’s kind of a high level of what I wanted to talk about. Now let’s dive right in. So I’m hoping that I’ll have time to go through both the heavy hitters and the shuffle. I’m going to keep this at a very high level. What I say should be very clear to everybody. So please feel free to ask questions if I say something that’s not clear. Okay, so again, the goal is to basically learn the list of data items that appear more than T times, so more than T people since this particular data item, like more than T people sent this particular URL. And nothing else. Now there’s a little star here because it turns out the most efficient protocols that we can give actually leak a little bit more information and we’ll see exactly what that is.

Okay, so let’s start with how the scheme works. And so I’m going to go through a warm-up scheme first. And this scheme actually doesn’t really work, but it’ll explain the idea of where we’re headed. So here, the idea is the following. So we’re going to create a tree where the leaves of the tree are all possible URLs in the universe. Yeah, so let’s say that the URL is at most 32 bytes. So the leaves of this tree would be all possible 32 byte strings. Yeah, so this is a tree that has basically an exponential number of leaves. So don’t worry about that for a second. Let’s just ignore that for now, and we’ll come back and fix it in just a minute.

So we have this huge number of leaves, and then the URL that you, the client, want to report to the server as a popular URL, basically corresponds to one of those leaves. Yeah so here for example, we have 011 as the popular URL that you want to report, maybe your homepage that you want to report. And you realize this leaf here defines a path from the roots to this particular leaf. So that’s the encoding of the URL. Now the client is going to do sort of what we saw before. It’s going to take the entire tree, and it’s going to secret share it between the helper and the server. So it generates two random trees. Again, these are exponentially big trees, but don’t worry about that for now. We’re going to compress them later. It’s going to generate two random trees that sum up whose sum actually is the tree that the client wants to send over.

So you can see, for example, that if you sum up all the nodes, they’ll basically sum up to the tree that we want to send. Great, so the helper is going to save its share of the tree, and the server is going to save its share of the tree. So we have one client sending these shares, another client sending the shares, and so on and so forth. Everybody sends their shares. And what the servers do, is let’s say there are M clients. What the servers will do is they will locally add up the M trees that they received from the clients. And now they have basically, the server has a sum of the top trees it received, and the helper has a sum of the top trees that it received. So far, this is just garbage. They learn nothing at all about user data because by themselves, these shares are just random data.

Okay, so now what do we do? Now let’s pretend for a minute that the threshold is three. So what we’ll do is the following sort of descent process. Okay, so what the two servers will do is they will reveal the root of the tree. Okay. So these are shares of the root of the tree. If you summed them up, you actually get the total number of submissions that were sent from the users. Yeah so here let’s imagine there were six submissions sent overall, so that’s the value of the tree, at the roots. Now they’re going to descend down to the second level of the tree, and they’re going to reveal the values. Now something interesting happens. You notice that the left sub tree has only two items under it, which means that none of the elements under the left sub tree exceed the threshold. So we can just ignore the entire left sub tree, and only continue to descend down the right sub tree.

And again, here, the right sub tree this time can’t exceed the threshold. We can ignore it and say we can descend all the way down. And lo and behold, we find the heavy hitter, and it turns out 101 is a heavy hitter that corresponds to a very popular URL that happens to have a count of four. Okay, and so if we do this actually for all the nodes that are bigger than the threshold, the server and the helper will recover all the nodes that happen to have account that’s bigger than T. Okay, so that’s kind of at a high level how the protocol works. Yeah, so it reveals the heavy hitters and their accounts and also the counts of all the prefixes. That’s what’s revealed here. So this is a great. So there’s no errors. The server learns exactly the set of heavy hitters. If one of the servers is honest, the server learns basically the set of all heavy prefixes.

In particular, it learns the heavy hitters, plus a little bit more information, but this is basically all that’s revealed by this protocol. So the question is, are we done? And the answer is no, because this protocol doesn’t work. We have to split exponential size trees between the helper and the server. So we have to do something more. Okay, so what do we do? Well, we’re going to compress these trees. And the way we’re going to compress them is using something called a distributed point function. Okay, so if you don’t know what a distributed point function is, I’m actually not going to try and explain it. It’s actually a super cool primitive. If you don’t know what it is, I really encourage you to look it up and read about it. Distributed point functions are super, super useful tools when it comes to user privacy. And the way we’re going to use distributed point functions is basically to compress these trees, and effectively what each user will do is it will send a distributed point function for every layer of the tree that it wants to split between the server and the helper.

So naively, this would actually take, if the tree height is N. Now usually, this will take like N distributed point functions, one for each layer of the tree. But we’re able to actually extend the construction of distributed point function to what’s called an incremental distributed pointing function. And this allows us to actually only send one distributed point function, rather than N. Naively, we would have N square communication. And this trick allows us to reduce communication to a linear amount only. Okay, so that’s how we compress all the trees. And we enabled a client to just send N values, N is remember the length of the URL. So let’s say 32 bytes. So we just allow the client to send N values to the helper and the server, and that represents all the data in the trees. And then the protocols that I just showed you works really well.

So that’s one problem, this actually makes this into a workable protocol. But there’s another problem in that misbehaving clients can send malformed secret shares and destroy the computation. Yeah, because remember the servers are summing over everything that all the clients send. So even if one client sends malformed data, that will destroy the entire computation. So in addition we design a simple zero knowledge proof that allows the client to prove that the secret shares actually correspond to a tree where only one leaf is non zero. Yeah, this is again, a very cute zero-knowledge proof, very, very efficient and very, very lightweight to verify. And so that basically solves the data integrity problem as well. Yeah, so the end result here is if you have N clients and each client is sending a URL of length of N, basically the server’s work is N times N times N to get a single heavy hitter.

So if there are let’s say K heavy hitters, this is the amount of work that the server would have to do for each one of those heavy hitters. And the paper actually has a lot of experiments to show that this is actually very practical for a very large number of users. All we’re doing is a symmetric AES operation, so this is all actually quite fast. But can we actually do better? Can we reduce the dependence on N? And the answer is, if all you want to compute is a histogram rather than heavy hitters. Again, the difference is that in a heavy hitter, we don’t in a heavy hitter protocol, we don’t reveal the tail of the distribution. We only reveal the part of the distribution that appears many, many times. That has high weights. In a histogram, we would reveal the entire distribution. So the question is can we do a histogram more efficiently? And the answer is, yes, we can use shuffling protocols to do histograms more efficiently.

So let me explain, let’s see, I only have a few minutes, so I think I can actually get through it. So let me explain what this does. So the idea is basically to use a mixed net, but a non-traditional mixed net. So in a traditional mixed net, what happens is every client is going to send an encryption of its data to the mixed nets. The mixed system is going to mix up all the cipher texts, and then it’s going to apply some sort of a decryption operation, so that what’s revealed is the shuffle of the data. So again, the elements are revealed in the clear, but the result of the mixed net doesn’t reveal which client sent which data. The problem with this is this actually reveals. This actually requires quite a lot of public key operations. So it’s fairly heavyweight in terms of public key work.

But more importantly, a mixed net actually has to implement what’s called a reliable shuffle. If effectively, the mixed servers have to prove that what they’re doing is literally just shuffling the data, they’re not actually, say, changing the data in some way, they’re not dropping some of the data in some way, it turns out if you don’t prove this, that nothing has been replaced and nothing has been dropped, all privacy goes out the window. And so really we have to have like what’s called a shuffle proof. And, for example, one of these famous shuffle proofs is what’s called a buyer growth shuffle proof. And so for robust mixing, you have to have each server has to produce a proof that it correctly shuffled the data. So this is again quite expensive, and we’d like to do something that is cheaper.

Okay, so again, our approach is going to be quite different. We’re not going to be doing an encryption based mixed. We’re going to be doing a secret sharing based mixed. Okay, so what will happen is I’ll come back to this MAC key in just a minute. But effectively, each client is going to secret share its data between the server and the helper. And then the clients are also going to secret share a MAC key, and they’re going to secret share a MAC of the data between the server and the helper. These MACs, these integrity checks are basically going to replace the shuffle proof that was needed in a traditional mix. Yeah, so we’re relying on a very, very simple MACs for robustness. All right. So what happens now is basically the server and the helper collect the contributions from the individual clients. They apply a robust shuffle protocol which I’ll show you in a second. And this gives them the final histogram.

It turns out if you do it this way, you get a system that’s actually 10 times faster than what you would get from buyer growth. Yeah, so secret sharing again is a very lightweight mechanism that is very amenable to be scalable and particularly it works so well because it only relies on symmetric primitives. So maybe I’ll just say very briefly how the efficient, robust mixing works. Basically again, you can see every client sends the data along with a MAC, with a share of a MAC key and the MAC. And then what happens is, there’s sort of a lightweight two-party computation between the server and the helper where they inside the multi-party computation, they check these very simple linear MACs. They drop everything that’s not properly MACed.

Then they implement a semi honest shuffle that’s due to a chase at all. And then finally they abort if the result has any invalid MACs, because that means that the semi on the shuttle was not done correctly. So these MACs basically is what provides robustness. And again at the end, you get the histogram. Okay, so that’s I think all I’m going to say about these types of protocols. As you can see, there’s a lot of detail, but I think I’d rather actually stop here. Okay, so again, as I said, this is like 10 times faster than what you would get from a traditional shuffled proof. All right. So what we talked about is basically kind of two approaches for data aggregation. Basically, we’re assuming two non-colluding servers. By making that assumption, we’re able to basically just use symmetric primitives.

We don’t need to rely on public key primitives. You should be asking, well, where is this non-colluding server going to come from? Who plays the role of the non-colluding server? And it turns out when industry needed to find a solution to it, they found a solution. And in fact, there’s this organization called the ISRG. They also run the let’s encrypt CA, and they’re quite happy to participate as a second non colluding server. Presumably this could be a good business opportunity for them. Companies might pay you to act as a second non colluding server. By definition, you cannot be acquired because the minute you were acquired, you’re no longer non colluding. So it’s a nice way to help sort of a long-term independent service. And so if needed, I guess, when it’s needed, industry finds a solution. And so we looked at basically constructing heavy hitters using counting along the tree.

We looked at building histograms using shuffles, and there’s a whole world of problems here. There’s a whole world of questions. Are there lightweight protocols for other large-scale data analysis that, again, is supposed to scale to billions of users, and needs to be robust against corrupt clients that try to destroy the computation. Okay, so that’s I think all I wanted to say, and I hope this was clear. And I’ll think I’ll stop here, and I’m happy to take any questions if there’s time for any questions. Thank you very much.

Dan Boneh

Professor, Stanford University

Professor Boneh heads the applied cryptography group and co-direct the computer security lab. Professor Boneh’s research focuses on applications of cryptography to computer security. His work includes cryptosystems with novel properties, web security, security for mobile devices, and cryptanalysis. He is the author of over a hundred publications in the field and is a Packard and Alfred P. Sloan fellow. He is a recipient of the 2014 ACM prize and the 2013 Godel prize. In 2011 Dr. Boneh received the Ishii award for industry education innovation. Professor Boneh received his Ph.D from Princeton University and joined Stanford in 1997.