Dr. Leemon Baird @ CyberEconomics and Security Conference 2017
October 2, 2017
Transcript Leemon @ CESC 2017
Today we’re going to talk about the need that we have for maths. That we really need to be proving the security of our systems. We’re building lots of different consensus systems but we really do need to get down to proving their securityˆ.
Asynchronous Byzantine Fault Tolerant (ABFT) is the highest level of security in a distributed system but there are various less secure level that can also be proved. The analysis is important.
I’m going to talk briefly about what is this byzantine thing. It’s kind of important. Then the question is can you actually achieve that in a fast system? Well let me give you an example. The system we have is Hashgraph. It is extremely fast. 100,000s of transactions per second. Sub-second Latency. But it is ABFT. So I’m going to give that to you as an example of one that you may find interesting. It is possible to have a rigorous math proof that something is ABFT but it’s also practical and you can implement it and it runs very fast.
I don’t know what you find entertaining but I kind of enjoy case studies. If someone developed an algorithm then I’m kind of curious to know what were the attacks they were thinking about while developing the algorithm. I’m going to talk about some of the attacks I had in my mind when developing Hashgraph. And it’s kinda fun to think about attacks. So I’m going to give a case study of that, hopefully it’s entertaining but the main message that I want to give is that what we need to do is get down to doing the maths.
What is Byzantine Fault Tolerance (BFT)?
When we say that something is byzantine fault tolerant (BFT) this goes back decades. We’re saying we’re trying to get consensus on something, either a yes/no question or the ordering of transactions. And typically we’re going to start with talking about just a single shard so it’s a replicated state machine. Every full node needs to get every transaction and we want to come to an agreement on what the order is. Simpler form is we’re going to come to agreement on a yes/no question. That’s the original form we talk about with byzantine. And what does it mean? It means we will come to consensus, guaranteed with probability 1.
The FLP says that we have to be nondeterministic so with probability 1, we will get consensus and we’ll know when we have consensus. It’s not like each confirmation makes me a little more sure and that just keeps continuing. No, with Byzantine there has to be a moment when I can say “Ah, I have consensus” and I have to never be wrong. So this is what byzantine fault tolerant means. It means I know I’ll reach consensus, I know it’s right when I reach it and I’m never wrong. And it also has to not be a trivial solution but that’s a last condition.
How To Achieve Byzantine Fault Tolerance
To achieve BFT we have to assume some things. For example we have to assume that less than a 3rd of our nodes are bad and we assume that an attacker controls the internet. These are reasonable assumptions, they’re nice assumptions. You might want to make that 1/3 be a higher number and there’s a theorem that says you can’t. 1/3 is the best you’ll ever achieve, no system can ever do better. So we have to make those assumptions when we’re doing a BFT proof. But we also have to say, what do we mean when we say that the attacker controls the interment? Do they have any limits on what they can do? Clearly we have to say something like, if Alice keeps sending messages to bob eventually 1 will get through, we have to assume that. But do we have to make any assumptions about the timing? If we don’t make any assumptions about the timing and we still get this proof then we call that Asynchronous Byzantine Fault Tolerant (aBFT).It’s aBFT when we don’t make any assumptions about timing over the internet. Now if we start making assumptions then it’s only partially asynchronous that’s not quite as good.
So what kind of assumptions might you make? A lot of people will prove it’s aBFT by making an assumption about timing like “When Alice sends bob a message it always gets there within 10secs” well that’s a pretty reasonable assumptions right? Or maybe we make the assumption that it gets there within X seconds for some constant X that we don’t know but it exists. That’s not a very terrible assumption, right?
Ah, but wait a second. What if there is a botnet doing a distributed denial of service attack on bob? What if somebody has compromised lots of little computers around the world, DVRs and web cameras and printers and baby monitors? They’ve compromised all these little computers and programmed them all to flood Bob with lots of packets so that Bob is shut down and can’t get Alice’s message?
Even though Alice and Bob are honest. Alice’s message doesn’t get to Bob in 10 seconds or 100 seconds or as long as the attack is going on. We violated the assumptions.
So this is the difference between partially asynchronous and asynchronous. Or to put it another way, if you have a proof of partially aBFT basically your proof would start off on the first line… “Assume there are no botnets in the world. Assume DDOS can never happen” and then you prove that it’s Byzantine.
Whereas in aBFT you start your proof by saying, “I won’t make any assumptions. Maybe the world does have botnets” and then you prove byzantine.
What I would propose is that we really do need to get proofs of how secure our systems are. I think we’re getting to the point where we actually need to start getting proofs. If you can prove aBFT that’s great, you should do that. If we have a system though where we can’t prove that, we have to prove something less, I would suggests let’s not prove partially asynchronous BFT, let’s prove Asynchronous-partially byzantine.
If we have to compromise something let’s compromise on the conclusions not the assumptions. Because what tells us more?
Do we learn more from a theorem that says “I am totally safe in a world that isn’t the world we’re in?”
That doesn’t tell me anything or is it better to have a theorem that says, “I’m somewhat safe and here’s exactly how safe I am in the world we actually live in?”
I think it tells us more. And so what I would propose is that we need to have theorems, actual math theorems worked out and defining all the details that prove aBFT if we can, but if we can’t do that then lets settle for reducing the byzantine half and not reducing the asynchronous half.
So that’s my take away message for today.
So is that possible, can we actually have abft systems that have rigorous math proofs of aBFT that are still practical? Sure. If you’d like to see one I’ll show you one. This is Hashgraph, This is what we have.
So first of all the goal here is to be fast. ABFT is easy, we’ve had it for 30 years. We know how to do it but the systems that we’ve had for 30 years are so incredibly slow that to my knowledge no one has ever deployed one of these algorithms in the wild.
A pure voting based system just isn’t practical so if we want real speed, how do we achieve it? Let’s start by forgetting about consensus entirely. If we’re talking about how fast a single shard is, a single replicated state machine, then all the nodes are going to have to have to get all the transactions. That’s the bare minimum we have to achieve even before we start talking about consensus.
What is the fastest way we can do it is by using a gossip protocol. It is the fast the most resilient way to do it. And it’s the simplest imaginable thing…
If Alice has as transaction that she wants everyone to know about, she takes her transaction and sends it to a completely random person, Dave.
Now instead of one person knowing it, 2 people know it.
Then each of them send it to a completely random person they choose and now 4 people know it.
Then each of them send it to a completely random person and now 8 people know it.
And you keep going this way until every person in the network knows it.
It is exponentially fast you can say it’s an expander graph and it’s even better than that in certain ways. If somebody attacks with that denial of service attack and shuts down one of our computers, we don’t care, we’re just constantly picking people at random. I pick someone that’s down and I just pick somebody else. And my transaction spreads out exponentially fast regardless. It is incredibly resilient, it is incredibly fast, it’s used all through computer science for all sorts of things because of its speed. So if you really want a fast system we should start this way.
Then the question is what happens if two people want to do a transaction at the same time?
So what if Bob wants to do the green transaction and Carol wants to do the blue, what do we do? Well, we could have a leader. They could send it to a leader and the leader posts them in order and sends them to every body but that’s a bottleneck. I want real speed I don’t want them to send them to a leader.
Or they could take turns, Bob sends his out and when he’s done, Carol could send out hers. But that is also slow, now Carol has to wait until Bob is done, well that’s not good.
What would be the fastest way to send it out? Just send it out. Each of them sends it to some random person and now they have two people knowing it and then each of them send it to a random person and each of them have 4 and them each of them has 8.
It explodes out exponentially fast until everyone has it. This is the fastest way. Great. The only problem is, there’s no consensus here. Look at Alice, her top one is green, the top one for Bob is blue. Everybody received them in a different order. Well what order should these be in?
This is the fastest way, this is the limit of speed for the internet but we don’t have consensus. All we did is got our transactions out. That’s no good. Let’s continue to ignore consensus for just a second and I’ll point out something kind of surprising.
If we add a tiny bit to each message like make it 1% bigger or even less. We can get an entire history of how we’ve all talked to each other, almost for free. We can see exactly who talked to whom in what order, exactly. And then the fun thing is after that we can get consensus for free. No further communication is required for consensus and it’s asynchronous Byzantine Fault Tolerant. That’s the Hashgraph algorithm.
How Does Hashgraph Work?
Let’s go back to the beginning. We did this gossip protocol. We could draw a diagram for the history of our gossip protocol. Here’s a diagram
Each of these circles is the event of some body gossiping to somebody. Time goes uphill so that one at the top is Bob being told everything that Carol knows because it has a diagonal line down to Carol. So the top circle is the event of Carol telling Bob everything she knows. By the way, in the protocol we just say you may have noticed sometimes someone actually passes two events when they gossip. The goal here in a gossip protocol is that when I call you up I tell you everything that I know that you don’t know. And it turns out there’s some really cool tricks so that I don’t have to tell you stuff that you already know and we can very quickly figure out what you need to know.
So that top circle means that Carol called Bob and told him everything that she knows and in this diagram you can actually follow exactly how things happened if Alice knew things at the bottom there, she told Bob and a little while later Bob told back to Alice and Carol at the bottom told Bob and he told back to Carol and you can trace back to the bottom and get a complete history of how we all talked to each other.
This is a gossip protocol. They’re used all over the place. We can gossip about transactions, which is what we’ve been talking about. We can gossip about identities, which I think was the original use of the very first gossip protocol. We can gossip about databases, information but what if we were to gossip about gossip? What if the thing that we’re telling each other, is this very picture? The picture of how we’re all talked to each other is the thing that we were all talking to each other about. Or to put it another way. Each of these circles is actually going to be a little data structure in the memory of my computer and when I call you I’m going to give you all the little circles all the events that I know that you don’t know. And then when you call someone else you’re going to give them all the events you know including the ones I just gave you so we’re going to gossip these events. Once we’ve gossiped the events, because they have these little lines connecting them, they actually give us this picture.
We’re all going to have the same picture. We’re all going to know exactly what each of us knew through out history which is so much information that Alice can say “I know what Bob knows, I know when he learned it, I know when he learned every piece of this diagram and his whole history. If we were to run one of those 30 year old byzantine voting algorithms I know exactly how Bob would vote so he doesn’t have to vote, I’ll just pretend he sent me the vote.”
That’s it. So what we’re going to do is Virtual Voting.
We don’t have to send each other any votes at all, nor receipts on our votes. It’s purely in your mind. But because we all have the same picture, we all have the same knowledge about the history of talking to each other, we will all come to the same conclusion. We’re all running the same deterministic algorithm on the same graph. We all get the same answer. And you can mathematically prove, using proof techniques very similar to those used 30 years ago that this is aBFT and we will all come to the same answer.
We will come to a conclusion, consensus and actually what we’ll do is put transactions inside these circles (events) too. When I gossip to you I’ll actually send you transactions too. So what we really want to do is put all of these circles into an order. We want a total order on events. And we’ll use one of the 30yr old algorithms with some tweaks in order to do that. And we can come to consensus on events extremely quickly. With zero communication over the internet.
What’s in those circles? One of those circles is like a Block, it has whatever transactions you want to put into it. It has a timestamp which is just one computers opinion on when they created the event, and it’s digitally signed. All of those things are what we’d have to do and it’s the bare minimum just to get the information out about the events. Then we added just to things, two hashes. Each circle in the diagram has two lines coming down from the previous ones. This is a Directed Acyclic Graph (DAG) to turn into a DAG all you need are two arcs on each vertex. Two edges on each vertex. Those are the two hashes. So we call this a Hashgraph it’s a graph held together with hashes.
The only thing we’re really adding is two hashes in each of these events. Those two hashes I send with the event, they are two objects you already have in memory. You already have a shortlist in memory, you know it’s two of those. So I don’t even need to send you the whole hash I can compress it down to a bit or two because you already have it in memory. I just have to remind you which of the ones in memory I’m referring to. So we can compress even that down to almost nothing.
So what we do is the fastest form of getting the transactions out that’s allowed by the internet and the laws of nature; we add a tiny bit of extra information which then gives us the entire Hashgraph, the entire history of how we all talked to each other; and then we get consensus for free. And it’s aBFT and we have detailed math proofs of all that.
So that’s kinda fun. It is possible to have consensus with mathematical proofs of aBFT, the strongest kind and they’re very fast. And so we’ve seen 100,000s transactions per second, we’ve seen less than a second consensus time. And when we get consensus you know you have consensus, you always reach a moment when you say, I absolutely know we’ve reached consensus and I’ll never be wrong as long as less than a 3rd of us are evil.
Types of Attacks
What are some of the attacks you were thinking about when you were building this? Or what are some of the attacks we as a community can think about?
Part of the reason I think it’s important to actually do math proofs of how secure we are is because there’s an infinite space of attacks and we’re never going to be able to think of all the attacks and make a list of all of them. We actually need to prove that our systems excluded all attacks or excluded huge classes of attacks that we can carefully mathematically define. And so that’s another reason I’m showing the attacks just to say “the space of attacks are really large.”
So when I was working on this I thought about proof of work, there were several reason why it wasn’t interesting:
It’s not byzantine
- Typically with proof of work systems there’s never a moment when you know you have consensus you just get more “sure” with time.
- I’m not particularly a fan of using a huge fraction of humanities electricity and spending lots of money on these mining rigs
- If the mining rigs are expensive, then you might expect that there are economies of scale and then you might expect that the number of miners would go down over time.
- If electricity is a big cost, then you might imagine that you would have geographic concentration where all the miners move to where the electricity is cheap. And if you have miner consolidation and you have geographic consolidation then perhaps that undermines the security the security system. And maybe there’s some self-defeating feedback cycles going on here.
- What happens if everybody moves to one country and then the country nationalizes the mining rig and sets them all to always extending the shortest half of a folk rather than a longest chain, then it just destroys the whole system. And would a state do that to destroy a cryptocurrency? Well probably not but I don’t really want to have to trust anyone actor. I don’t want to have to trust any one state not to destroy it. And so that’s a concern. I want to have distributed trust. So I don’t want geographic consolidation because then one state can nationalize all the mining rigs and cause them to attack our network.
Bitcoin is a 34% vulnerability. It turns out there’s a math theorem that everybody is 1/3 vulnerable. There is no 51% vulnerable system out there.
If we have a bunch of nodes and what happens if we have geographic consolidation and maybe a country builds a firewall around them. What happens if they turn of the firewall? Could we get a different chain on the inside versus the outside? Maybe we could. Oh we’ll build a partition detector. Ok what if they start flickering the firewall? Will that interact badly with our partition detector? I don’t know. Or what if they start just delaying packets enough so we start to have a difference of opinion about which chain is longest. Or maybe they do other malicious things.
For example what if those yellow nodes are compromised. A third of our computers are bad or in this case a third of our hashing power is bad, and we have a firewall and the firewall is colluding with the 1/3. You could imagine on any given yes/no questions, like with a fork, of these two chains should I extend this one, is this the longer one? You could imagine that the honest nodes outside the firewall happened to vote yes. The honest nodes inside the firewall happened to vote no and the dishonest nodes are duplicitous and vote no to the guys inside and yes to the guys outside and they’re going through the firewall because they’re colluding and now we have a real fork.
The people inside think that 2/3 of us are saying no and the people outside think that 2/3 of us are saying yes. This is a problem that causes you to end up with a false consensus with only a 1/3 of us being compromised. Now that attack doesn’t really bother me because every system has that attack. And I’m only putting it up because it was one of the things I was thinking about.
We have a decade old theorem that says every system is vulnerable to 1/3. The proof of the theorem is basically that picture. That’s a math proof. But it’s interesting to know. We just need to be clear as a community what our security is and none of us are more secure than 1/3 of the influence. So if you use hashing then it’s 1/3 of the hashing power. If you count by IP addresses then it’s 1/3 of the computers. If you do proof of stake then it’s 1/3 of the coins or 1/3 of the stake. But nothing can survive more than 1/3 being compromised assuming the bad guys are able to do things like firewalls.
Leader Based Systems
Now leader based is not quite 30 years old although it pretty close so you have Practical Byzantine Fault Tolerance (PBFT), Paxos and Raft. All the leader based systems that you may be familiar with. The idea is really simple. Everybody sends their transactions to the leader, the leader gets to pick the order and tells everybody what the order is and we all agree and then maybe we send signed receipts back. Beautiful. And maybe the leader, eventually crashes and then we have to elect and new leader and then the algorithm are really complicated to do that but then they work it out in a byzantine way. Beautiful, I love it. Now you don’t get fairness. We haven’t talked about fairness and I’m not going to talk about fairness very much. But it’s not fair because the leader can put them in any order they want or even leave out one if they want to. The Hashgraph algorithm by the way we have a math proof that it’s fair.
So what happens with leader based systems? If Alice is the leader, here’s the problem in addition to fairness is the Distributed Denial of Service (DDOS) Attack. What happens if Bob is malicious or has a virus on his computer and he directs a Botnet (a bunch of compromised computers) out on the internet and they all attack Alice, shutting her down? Then we’re all shut down.
And by the way this might not be a traditional Leader Based system, this might be a leader based system where we change leaders every 2 seconds. Wonderful, they only shut us down for 2 seconds, right? Maybe it’s a traditional one but maybe after 15 seconds we all get tired and elect a new leader, great then they only shut us down for 15 seconds. Or maybe this is two-phase commit in a database with locks. This is actually another form of leader based system. Who’s holding the lock is the leader but maybe eventually we decide that who ever was holding the lock crashed and we claw back the lock and give it to someone else.
There’s lots of things we can do, but Bob has an infection or is malicious and he can say, when we elect Carol as our new leader, before she gets to do anything as a new leader, he redirects the Botnet against her. And so he can shut it down forever. So even if we’re doing a round robin where we have a new leader every 2 seconds, a Botnet can shut us down forever, only shutting down one computer at a time. Very fragile. This is why I really didn’t want to go down the route of Leader based, I really wanted to be more that everybody is equal at every moment. And many systems that you wouldn’t think are leader based, like a system where we’re all going to vote but one person has to propose the order, propose the next block…is that proposer a leader now? There are a lot of hidden leaders in a lot of systems. And maybe they’re vulnerable to this and plus they’re always vulnerable to the fairness problem.
Simulated Economies (Proof of Stake)
There’s a lot of ways of doing this, you could talk about a simple system where we all bet on which of two chains is longer. And then the people who bet the same way get money and the people who bet the wrong way lose money. Then we have an economic incentive to all agree with each other. Are there any problems with that? Sure, what if there’s a firewall, do we end up getting partitioned? Do we have firewall detectors? Do they interact badly? Or what if the fire wall is just delaying messages so we have different opinion of what’s the longest thing, is that going to hurt us? I don’t know.
It’s certainly not going to be byzantine in this sort of a situation because you have these two forks that take awhile to resolve themselves. And then what if we’re infected. What if we just have a tiny amount of computers, with a tiny amount of hashing power, and a tiny amount of stake… But half of them send a yes to half of the population and half of them send a no to the other half of the population: Where we’re voting on some yes/ no question and they’re really fast, maybe they use a botnet to talk faster or they have faster internet connections.
You could imagine people being tricked to vote for the wrong thing, we end up having half of us voting for the wrong thing so half of us lose money. If we do this a bunch of times, all the honest people start saying, “I’m not going to vote until I see enough other people voting, I’m not going to go first.” Well if all the honest people say I’m not going to go first, nobody ever goes first and the entire network freezes forever. I was never able to prove true asynchronous byzantine in one of these systems. And if we have one of these systems that really are ABFT then we need to take time to prove it.
It’s great that we’re building systems that have no proof at all for security, but I think we’re getting to the point where we need to start having proofs. So anything is better than no proof.
If you can do partially-asynchronous byzantine that’s better than no proofs. Do that. Take the time to do that proof.
If you can compromise on the byzantine half instead of on the asynchronous half, and do Asynchronous-Partially Byzantine obviously that’s better.
And then of course if you’re able to proof Asynchronous Byzantine Fault Tolerance that would be best so do that.
And it’s not on the slide but if you can prove fairness, that’s even better.
- Why is compromising on the Byzantine part better than compromising on the Asynchronous part?
- I’d rather know that a system is somewhat secure and exactly what ways it’s insecure, in this world, than to know that it’s totally secure in a fantasy world. So if you believe that Botnets exist, I really want your assumptions to say that there are Botnets in the world.
- Could you expand on the modeling of honest versus malicious nodes. Are there any alternative models to the honest versus malicious dichotomy?
- Short answer No. There are theorems that say this is the best you can do. But of course honest here, doesn’t really mean honest and malicious doesn’t really mean malicious. There are lots of caveats on what exactly we’re talking about. What we have is that we want people to be following the protocol, and we can incentivize people to follow it and we can punish people for not following the protocol.
Ultimately we have to engineer a system where more than 2/3 of us are following the protocol. Because if more than a 3rd of us are willing to violate the protocol and do arbitrarily bad things, we are sunk. And that’s a math theorem. And that’s a fact of life.
- Isn’t the whole point of the FLP Theorem that you can’t get byzantine fault tolerance, with both consistency and availability?
- The FLP says that you can’t do it without randomness. So you use the random number generator in your computer and you can do it with probability 1.
FLP is an impossibilities theorem for deterministic algorithms, not non-deterministic. So if you use random numbers you’re fine. And then you don’t get guaranteed, you get guaranteed with probability 1.
That’s a common misconception about FLP, FLP is impossible unless you use entropy.
- In the DAG, if you have a double spend on two different blocks and you can’t order them, do you have some sort of heuristic for deciding ordering for those two different blocks.
- The whole point of the byzantine work, was to put all those circles in a total order. So we get a total order on all those circles, guaranteed with probability 1 as long as less than a 3rd of us are bad.
The whole point was to put it in a total order, I didn’t get to the actual algorithm but it’s the 30 year old algorithm.
So yes it’s a total order, not a partial order. The DAG gives you a partial order but then we extend that to a total order. That’s the whole point of the system.