An interactive deep-dive into two landmark distributed consensus algorithms — Raft and Viewstamped Replication — with live simulations you can explore and break.
Distributed systems are hard because machines fail. A server can crash mid-write, a network can partition, a message can vanish. The question at the heart of fault-tolerant systems is: how do a group of servers agree on anything when any of them can fail at any time?
That question is answered by consensus algorithms. Two of the most important are:
- Viewstamped Replication (Liskov & Cowling, 1988) — historically the first complete consensus protocol for state machine replication
- Raft (Ongaro & Ousterhout, 2014) — designed from the ground up to be understandable, now the backbone of etcd, CockroachDB, and many cloud systems
Both algorithms solve the same problem with surprisingly similar shapes. Let's explore them interactively.
The visualizations below are live simulations. Click any node to disconnect or reconnect it. Hit 📨 Client write to send a request — watch it travel from the client icon to the leader, then propagate outward as AppendEntries RPCs. The Replication Log panel on the right shows each node's log in real-time: filled squares are committed entries, hollow squares are uncommitted. Use ⚡ Partition 3|2 to split the cluster and observe what each half can do.
Raft
Raft decomposes the consensus problem into three largely independent sub-problems:
- Leader election — exactly one server must be authoritative at any term
- Log replication — the leader accepts writes and replicates them to followers
- Safety — once an entry is committed, it can never be overwritten or lost
Cluster roles
Every Raft server is always in one of three states:
| State | Colour | Description |
|---|---|---|
| Follower | 🔵 Blue | Passive; just applies entries from the leader |
| Candidate | 🟡 Yellow | Campaigning to become leader |
| Leader | 🔴 Red | Handles all client writes; replicates to followers |
Leader election and the election timeout
Each follower has an election timeout — the arc you see growing around each blue node. If that timer expires without receiving a heartbeat from the leader, the follower assumes the leader is dead and promotes itself to Candidate.
The candidate increments its term number (visible inside the node as v1, v2, …), votes for itself, and fires RequestVote (purple circles) at every other server. A server grants its vote if:
- the candidate's term is at least as large as its own, and
- it hasn't already voted for someone else this term
If the candidate collects votes from a majority of the cluster, it wins and becomes the new leader — and immediately starts sending AppendEntries heartbeats (blue circles) to reset everyone's timers and prevent a new election.
The randomised timeout window (each follower picks a different timeout) ensures that in most cases only one candidate starts an election at a time, avoiding perpetual split votes.
Replication Log
▶ Simulation started — RAFT
Try it: disconnect the red leader node (click it). You'll see the blue nodes' timer arcs drain. One runs out first, turns yellow, and fires purple RequestVote RPCs. Green VoteGranted circles come back. Once a majority responds, a new red leader appears and immediately sends blue heartbeats.
Reconnect the old leader — it receives a heartbeat from the new leader with a higher term, realises someone else is in charge, and quietly steps back to follower.
Log replication
Once a leader is elected, it handles all client writes. The flow is:
- Client → Leader: a write request arrives at the leader (the pink circle traveling from the client icon on the left)
- Leader appends the entry to its own log
- Leader sends
AppendEntriesto all followers in the same heartbeat round (the blue circles — these double as heartbeats when empty) - Once a majority of servers acknowledge, the entry is committed — the log panel shows it turn from hollow to filled
- Leader piggybacks the new
commitIndexin the nextAppendEntries, so followers advance their commit pointer too
This majority requirement is the key safety property: a committed entry must be on enough servers that any future leader will have it in their log. Raft's Leader Completeness guarantee means a new leader is only elected if its log is at least as up-to-date as any majority — so committed entries are never lost.
Commit index, quorum, and uncommitted entries
Every Raft node tracks two numbers:
log.length— how many entries have been appended (proposed but not necessarily safe)commitIndex— how many entries are durably committed (a quorum has acknowledged them)
The gap between the two is the uncommitted tail — entries the leader has proposed but hasn't yet heard back from a quorum about. These are shown as hollow squares in the log panel. Uncommitted entries are not safe: if the leader crashes before reaching quorum, a new leader may have a shorter log, and any follower with extra entries must truncate them back.
An example walkthrough:
- Click ⚡ Partition 3|2 to isolate N4 and N5 from the majority.
- Hit 📨 Client write two or three times. Watch the majority partition (N1–N3) commit those entries — their log panel squares fill in solid.
- Now click the leader to disconnect it mid-term.
- A new leader is elected in the majority. If any follower received entries that were never acknowledged by a quorum, those entries show hollow; when the new leader's first
AppendEntriesarrives, you'll see the event log print✂ Nₓ removes N uncommitted entriesand the hollow squares disappear. - Click 🔗 Heal partition — N4 and N5 rejoin as followers and the current leader sends them the missing committed entries, catching them up.
Why is this safe? The new leader must have a log at least as up-to-date as a majority (enforced by the vote-granting check), so committed entries are always present in the winner's log. Only entries that never reached quorum can be overwritten — and by definition those were never acknowledged to any client.
Catch-up: AppendEntries and InstallSnapshot
When a follower falls behind — perhaps it was offline, or sits in the minority partition — the leader uses AppendEntries to replay the missing suffix of the log. The leader decrements its estimate of the follower's position (nextIndex) until the logs agree, then sends all entries from that point forward. In the log panel you can watch a rejoining node's squares fill in one heartbeat round at a time.
For nodes that are very far behind — hours of log entries in a production cluster — replaying the full log would be too slow. Raft handles this with InstallSnapshot: the leader compresses its entire state machine into a snapshot, ships it to the lagging follower in one RPC, and the follower jumps straight to the snapshot point, discarding its old log up to that index. This is how systems like etcd and CockroachDB handle nodes that have been offline for extended periods.
Viewstamped Replication
VSR (first described in 1988 and revised by Liskov & Cowling in 2012) frames the problem as state machine replication: keep replicas of a state machine in sync so that the system continues to work even when some replicas fail.
Terminology
VSR uses different names for the same concepts:
| VSR term | Raft equivalent |
|---|---|
| Primary (red) | Leader |
| Backup (blue) | Follower |
| View | Term |
| View change | Leader election |
| Op-number | Log index |
Normal operation
During normal operation, the primary receives client requests and drives replication:
- Primary sends
Prepare(view, op-number, request)to all backups (blue circles) - Each backup increments its op-number, appends the entry, and responds
PrepareOK(teal circles) - When the primary has collected
PrepareOKfrom a majority, the entry is committed - Primary broadcasts
Commit(yellow circles) so backups can advance their commit pointer
The primary also sends periodic Prepare messages even with no new client operations, acting as a heartbeat and keeping backups' timers reset.
Replication Log
▶ Simulation started — VSR
Try it: disconnect the primary. Backups notice the silence, their timers expire, and they kick off a view change. Red StartViewChange messages propagate through the cluster. Purple DoViewChange messages converge on the deterministically chosen new primary (the node at index view mod N). Once it collects a majority, it fires yellow StartView to announce the new view, and normal Prepare replication resumes.
View change protocol
The VSR view change is more structured than Raft's election:
- A backup whose timer expires increments the view number and broadcasts
StartViewChange(v+1)to all replicas - Any replica receiving this message also broadcasts it (propagation), and sends
DoViewChangeto the designated new primary — deterministically the node at indexv+1 mod N - The new primary waits for
DoViewChangefrom a majority, then starts the new view withStartView
The key difference from Raft: the new leader is not elected by competition — it is mathematically determined by the view number. If that node is also down, the view number keeps incrementing until an available node lands on its slot.
Raft vs. Viewstamped Replication
Both algorithms tolerate up to ⌊(N−1)/2⌋ simultaneous failures in a cluster of N servers. A 5-node cluster can lose 2 nodes and still make progress.
| Property | Raft | VSR |
|---|---|---|
| New leader selection | First candidate to win a majority vote | Deterministic: view mod N |
| Heartbeat mechanism | AppendEntries (also carries log entries) | Prepare + separate Commit |
| View/term tracking | currentTerm | view-number |
| Vote deduplication | votedFor per term | Implicit via DoViewChange quorum |
| Historical role | Modern, pedagogical (2014) | Pioneer; predates Paxos (1988) |
Conceptually, the two are duals: Raft handles leader election and log replication in the same RPC (AppendEntries), while VSR separates replication (Prepare/PrepareOK) from commit notification (Commit) and keeps view changes as a distinct three-phase protocol.
Both algorithms require a quorum — a strict majority of servers — to make any progress. This is unavoidable: it is the minimum requirement to prevent split-brain (two leaders simultaneously accepting writes). With N=5, you need 3 servers alive. Try adding nodes in the simulation and then disconnecting several at once to see when the cluster halts.
What the simulation doesn't show
The visualisation is intentionally simplified to keep it legible:
- Log contents are simplified — entries only carry the term they were created in, not actual data
- Disk writes are omitted (in real systems, entries are
fsynced before acknowledging) - One-directional partitions are not modelled — the partition button disconnects both directions symmetrically
- Membership changes (adding/removing nodes without downtime via joint-consensus) are not modelled
- Pre-vote and leadership transfer extensions (common in production implementations) are omitted
For a deeper treatment of these topics, the original papers are surprisingly readable.
Further reading
- Ongaro & Ousterhout, In Search of an Understandable Consensus Algorithm (Raft paper, 2014)
- Liskov & Cowling, Viewstamped Replication Revisited (2012)
- raft.github.io — Ongaro's own interactive visualisation
- Lamport, Paxos Made Simple (2001) — the third corner of the consensus triangle
- Howard, Flexible Paxos (2016) — a generalisation showing quorums don't have to be majorities
Related Articles
The Challenges of WebGL
WebGL lets you run shaders in the browser — and that's genuinely magical. But using it for arbitrary GPU computation quickly reveals a thicket of constraints that OpenCL and WebGPU were designed to escape.
The Abelian Sandpile: A Group Hiding in a Pile of Sand
The Bak–Tang–Wiesenfeld sandpile model (ASM) is a deceptively simple cellular automaton that conceals a rich algebraic structure — an abelian group whose identity element is a non-trivial fractal-like configuration.
From Regex to Automata to Generators
How regular expressions compile to deterministic finite automata — with an interactive visualizer, a DFA-powered string generator, and a note on closing the loop with property-based testing.