Famous Paxos distributed protocol automatically determined safe and secure

Two researchers have debunked the common assumption that the famous Paxos consensus protocol is too complex to be proven safe without hours of manual labor.
Blockchain graphic

In the ten years since 2010, the global cloud services industry has grown to a $370B valuation while more than 50 percent of all corporate data has moved to the cloud. Platforms and software as services, blockchain applications, big data and machine learning, and a slew of burgeoning new technologies continue to fundamentally change how organizations and individuals engage with computing, and it’s all powered behind the scenes by networked machines under a constantly growing load. As consequence, the world is more susceptible than ever to widespread fallout from outages, hackers, and buggy network behavior.

This is where airtight distributed protocols are needed to ensure software systems can effectively run on machines spread across the world. University of Michigan researchers have made an important step toward determining that these protocols are safe and secure by automatically proving those traits in the most foundational distributed protocol, called Paxos. Using a technique called automated formal verification, Computer Science and Engineering PhD student Aman Goel and Prof. Karem Sakallah have refuted common beliefs that the famous consensus algorithm and others like it are too complex to be proven safe without hours of tedious manual effort.

Formal verification is a class of techniques used to demonstrate that something is correct and reliable with the elegance of a logical proof. The process is very useful for software and hardware alike, providing a certificate that a certain algorithm, working piece of software, or computer chip will always operate the way its specifications say it should.

In the software space, however, formal verification has only been practical on very small programs for much of the technique’s history. Theoretically, it would enable software to be released with substantially less testing needed than we see with current practices. 

“Having a foolproof system that says: you develop it, you check it automatically, and you get a certificate of correctness,” Sakallah says, “that’s what gives you confidence that you can deploy a program without issue.”

But proving the correctness of a program with many complex behaviors can range from tedious to impossible – making techniques to automate the process extremely powerful. But for a truly large algorithm like Paxos, automating its formal verification was deemed simply too large a job to ever finish successfully.

“There have been many attempts in the past to verify Paxos, including many manual attempts,” says Goel. “Everyone points to a prior theoretical result that says automating it is impossible – it’s beyond the tools of automation to be able to prove it.”

Paxos is what’s called a consensus algorithm, and one of the most important algorithms in the category of distributed protocols. These protocols define how machines in a network can work collaboratively as a single system, and have to account for a variety of constraints: machines have to be in agreement (or “reach consensus”) about the state of the system at all times; the system has to continue on unchanged if any machine or group of machines on the network goes offline; and the system has to operate without a single central server overseeing the correctness of all operations.

“Paxos is one of the first and most celebrated ideas which laid the foundation for how different things come to an agreement asynchronously,” says Goel. “Most (if not all) the consensus algorithms fundamentally derive concepts from Paxos.”

Some variety of consensus is used in nearly all critical distributed systems, including all of the applications supported by cloud computing. Most recently, consensus has garnered widespread attention for enabling blockchain applications like cryptocurrencies. Such protocols form the backbone of a blockchain by helping all nodes in the network verify transactions as they happen.

Verifying the correctness of Paxos automatically has major ramifications for the future. As new consensus protocols are built atop its principles for these ever-changing applications, they’ll need to be proven safe and secure. Thankfully, the solution Goel and Sakallah produced is not tied specifically to Paxos.

The group’s solution came from a common feature of all distributed protocols, and the insight was inspired in part by their prior work with the analysis of computer hardware. That feature is symmetry, or regularity, and it helped Goel and Sakallah demonstrate that distributed protocols aren’t actually as complex as they look.

“We did not use any special feature of Paxos itself, but different structural features of all distributed protocols,” explains Sakallah. 

Regularity in this case can be interpreted in two different ways – regularity between the features and behaviors of nodes in the network, and regularity over time. By their nature these distributed services are very repetitive. All servers working on a particular function will be handling large batches of requests that look fundamentally the same. On top of that, the nature of their tasks will change very little over time.

This regularity enabled Goel and Sakallah to transform what started as an impossibly large task into one that looks small and manageable. They did so quite literally – by verifying the protocol under the assumption that it had a fixed, small number of nodes, and then generalizing the solution to a “theoretically unbounded number” of nodes.

“Human-created artifacts seem to have some structure to them that make them tractable,” Sakallah explains. “People do not create random things, they create very complex, but highly-structured, things: an airliner, chips, computers, software programs.”

The tool the researchers designed for this proof is called IC3PO, a model checking system that looks through every state a program can enter and determines whether each one matches a description of safe behavior. ​​If the protocol is correct, IC3PO produces what’s termed an inductive invariant – a proof by induction that the property holds in all cases. If instead a bug is found in the protocol, it will produce a counter-example and execution trace, showing step by step how the bug manifests.

The inductive invariant IC3PO produced for Paxos identically matches the human-written one previously derived with significant manual effort using a technique called interactive theorem proving.

The use of a model checker like IC3PO has a dual benefit. Besides taking the long work of proof writing out of the hands of people and greatly speeding it up, it also enables humans to work with complex software that’s proven safe without having to understand every minor detail of how it works.

“IC3PO provides a very compact proof for Paxos in less than an hour,” Sakallah says. “Human understanding of intricate details are not needed – once you get the proof from the model checker, that is the finished proof for the protocol.”

The general applicability of Goel and Sakallah’s solution has already borne fruit. In a followup work presented at the 13th NASA Formal Methods Symposium, Goel examined a smart contract used by cryptocurrencies to implement multi-signature wallets. If automated agreements like these are buggy, then the contracts built on them are faulty. Goel automatically verified one such wallet protocol using IC3PO.
This project was presented on October 20 at the 2021 Formal Methods in Computer-Aided Design Conference in the paper “Towards an Automatic Proof of Lamport’s Paxos.”