Recently I've been playing around with a new algorithm known as CASPaxos. In this post I'm going to talk about the algorithm and its potential benefits for distributed databases, particularly key-value stores.
Distributed databases must be reliable and scalable. To achieve reliability, DBs replicate data to other servers. To achieve scalability in terms of total storage capacity, DBs must allow the data to be replicated to only a subset of servers - enough to make the data reasonably reliable but not so much that adding a new server does not increase the total storage capacity of the system or make the system unbearably slow. A typical replication factor is 3: each piece of data is stored on 3 servers. Replications is typically implemented using a consensus algorithm. Well-known algorithms in this family that are used for replication are Raft, Multi-Paxos, and ZAB (which is used in ZooKeeper). Those 3 algorithms make servers agree on the ordering of operations in a log. By executing those operations in order, the database engines on each server can create identical replicas of a database. Logs feature very prominently in distributed/reliable systems (Read The Log: What every software engineer should know about real-time data's unifying abstraction by Jay Kreps).
CASPaxos is a new algorithm in this space and it is significantly simpler than the aforementioned algorithms because it does not use log replication. It is a slight modification of the original Paxos algorithm, which is very simple and typically used as a minimal building block for more complicated algorithms such as Multi-Paxos. Instead of replicating log entries between servers, CASPaxos replicates entire values. Because of this, it is best suited for relatively small values, such as individual entries in a key-value store.
So why is this interesting? In short: it offers us simplicity & performance. Before getting into its benefits, here's a sloppy, inaccurate description of CASPaxos - I recommend you read the paper.
CASPaxos replicates changes to a single register amongst a set of replicas. The register holds a user-defined value which is modified by successive application of some change function (which is a closure). Each of these modifications are protected by version stamps (ballot numbers) which help to ensure that previously committed register values are not clobbered without being first observed by the writer. The protocol facilitates learning previously committed values so that replicas can keep up with one another.
If you are familiar with Raft, you will know that at its core it replicates a log of values. Conceptually, a log-based replicated state machine folds a fixed function over multiple data (the log entries). By contrast, CASPaxos does not use a fixed function and instead folds varying closures over state, with the resulting state itself being replicated to other replicas.
To illustrate, the following expansions show the result of applying
[e0, e1, e2] (log entries) in Raft, versus
[f0, f1, f2] (closures) in CASPaxos:
state = f(e2, f(e1, f(e0, ∅)]))
state = f2(f1(f0(∅)))
Aside from what gets replicated and how the current state of the system is computed, Raft and CASPaxos are vastly different. For example, CASPaxos is leaderless, whereas Raft uses a strong leader. CASPaxos does not specify the use of heartbeats (in the core algorithm), whereas Raft does. Many of these differences are present because Raft is a more batteries included algorithm which covers much of the practical concerns involved in building a replicated database.
Neither approach is strictly better than the other, but since the CASPaxos approach (replicating state values rather than log entries) was fairly novel to me in the context of distributed conensus, I'd like to explore some of the implications, especially as they might apply to the systems I work on.
Read the paper to understand the algorithm in more detail.
To analyse the performance implications of CASPaxos, we need to take a little detour and discuss real-world systems. One great example is CockroachDB, a distributed SQL database. CockroachDB aims to be reliable and scalable. To achieve this, they partition their data and replicate each piece of data to a subset of the servers in the system using an algorithm they call MultiRaft. If they were to use a single Raft consensus group, then adding additional servers would not increase the total capacity of the database. If they use many Raft consensus groups naively, the overhead of each consensus group would have a toll on throughput. For example, Raft requires heartbeat messages while idle to maintain leadership. MultiRaft requires multiplexing each consensus group's log records on disk for performance. That means that log entries for each group might not live near each other on disk, since they are interspersed with many other groups' records. This may take a toll on recovery performance. The alternative is to store each group's log in contiguous disk segments, but this reduces write throughput: spinning disks and SSDs both perform better when operating sequentially. The optimizations required to make Raft scale well are tricky largely because of its log-based nature.
Speaking of storage, let's talk briefly about storage engines. The storage engine is the database component responsible for reading and writing data in a reliable way. Examples include RocksDB, LMDB, ESENT (used in Exchange & Active Directory), WiredTiger, TokuDB, and InnoDB. Two of the most common data structures for implementing a storage engine are B+ Trees and more recently, Log-Structured Merge-Trees (LSM trees). In order to make B+ Trees reliable (any machine may crash at any time), a Write-Ahead Log (WAL) is used. This log is a file containing a sequential list of the database transactions which are being performed. The storage engine eventually applies these transactions to the database image. During crash recovery, the storage engine reads this file and ensures that all of the committed transactions have been applied. This recovery algorithm is called ARIES and it can be found in many reliable storage engines. So B+ Trees split your data into two parts: a log file and a tree. As the name implies, Log-Structured Merge-Trees include a log as a part of the core data structure and again, that log is used for recovery. Since spinning disks and SSDs perform best with sequential reads & writes, log files are a good fit for high-performance, reliable systems.
Raft is built around log replication, so it might make sense to integrate with the storage engine so that a single log can be used for both purposes: local durability as well as replication. Unfortunately, the storage engine's log is generally not visible to the storage engine consumer and is usually considered an implementation detail. This means that Raft implementations which use an off-the-shelf storage engine such as RocksDB must store log records inside the storage engine so that they can be read back later. The result is that each operation needs at least 2 writes (1 on the critical path): one for the log entry and one for the result of applying the log entry once it's committed (eg, updating a value in a key-value store). A B+ Tree engine needs 4 writes (1 critical). By contrast, CASPaxos needs just 1 write: updating the value itself. Log-based algorithms have natural write amplification where as CASPaxos does not.
By removing the need for logs, CASPaxos can achieve high write throughput with off-the-shelf storage engines.
Each key in a key-value store based on CASPaxos is completely independent of all other keys. This means that no cross-key coordination is required when serving operations on individual keys. Compare this with Raft or MultiRaft where all operations within a given consensus group are strictly ordered. This ordering requires coordination which has some overhead. It means that a slow operation on one key can more easily impact operations on other keys. The low level of coordination required by CASPaxos supports high-concurrency systems without added complexity.
Coordination is sometimes required. For example, when implementing multi-object transactions. Multi-object transactions can be implemented as a higher layer on top of a key-value store with linearizable keys using two-phase commit (2PC). For example, this is how we implement ACID transactions in Orleans, supporting any strong consistency key-value store.
So far we've talked about ways in which CASPaxos might be more suitable for building a distributed key-value store than Raft or MultiRaft. CASPaxos is a simple algorithm and there are many system design questions which are not addressed by the paper definition. So here are some potential challenges when building a real-world system on CASPaxos, as well as some thoughts on how to solve them.
When adding a new server to the database system, the server needs to be brought up to speed with the existing servers. This requires adding it to the consensus group as well as copying all data for the keys which it will be replicating. The CASPaxos paper describes this process as a part of membership change. However, a similar process is needed to ensure that data is sufficiently reliable. For example, if a server loses network connectivity for a few seconds then it may miss some updates to some rarely updated keys. The CASPaxos algorithm does not discuss how to ensure that all updates are eventually replicated. In Raft, it is the leader's responsibility to keep followers up to speed. In a system built around CASPaxos, which is leaderless, we will likely need to implement a different solution.
The membership change algorithm in the paper does not offer safety in all cases and it implies a single administrator in the system. Therefore, it is not suitable for use with automated cluster management systems. The proof-of-concept CASPaxos implementation on Orleans, uses a different membership change algorithm. It ought to be suitable for automated systems (such a the cluster membership algorithm used in Orleans). I believe the algorithm will be safe once fully implemented, but that has not been demonstrated yet. The key idea is to leverage the consensus mechanism of the protocol for cluster membership change, similar to how Raft and Multi-Paxos commit configuration changes to the log. It uses a special purpose register to store cluster configuration. Proposers indicate which version of the configuration they are using in all calls to Acceptors and Acceptors reject requests from Proposers running old configurations. This is similar to Raft's notion of neutralizing old leaders. Additionally membership changes are restricted to at-most one server at a time, which is a special case of joint consensus. This the same restriction that Diego Ongaro specified in his Ph. D dissertation for Raft. In a sense, this extension turns CASPaxos into a 2-level store with the cluster configuration register at the top and data registers below, so the ballot vector is
[configuration ballot, data ballot].
Adding additional servers should increase the total storage capacity of the system. CASPaxos specifies only the minimal building block of a key-value store, so this scale-out is not discussed in the paper. The Raft paper also does not specify this, motivating the development of MultiRaft for CockroachDB. The dynamic range-based partitioning scheme used by CockroachDB is a good candidate. Implementing this might involve storing range configurations in registers and extending the membership change modification to include 3 levels,
[cluster ballot, range ballot, data ballot].
CASPaxos is not suitable for replicating large values because each value is sent over the wire every time it is updated. For a replication factor of 3, the entire value is sent 3 times for every update and 6 times if the proposer cannot take advantage of the distinguished leader optimization.
This limitation could be alleviated in several ways, or it can be ignored and argued away, leaving users to tackle the problem themselves if they truly need large values.
One way to alleviate it might be to split keys over several registers. Without going into detail, this might involve extending the membership change modification yet again to include 4 levels, at which point it may make sense to generalize it into a ballot vector,
[...parent ballots, register ballot]. Specifically,
[config ballot, range ballot, file ballot, register ballot]. At this point, the system is structured more like a tree than a flat key-value store.
I hope you've enjoyed the post. If you'd like to discuss any aspects of it, for example some glaring inaccuracies, drop me a line via Twitter (@ReubenBond).
Distributed Systems is a young field with many exciting areas for research and development.