Raft & Viewstamped Replication: Consensus Visualized

An interactive deep-dive into two landmark distributed consensus algorithms — Raft and Viewstamped Replication — with live simulations you can explore and break.

·10 min read

Distributed systems are hard because machines fail. A server can crash mid-write, a network can partition, a message can vanish. The question at the heart of fault-tolerant systems is: how do a group of servers agree on anything when any of them can fail at any time?

That question is answered by consensus algorithms. Two of the most important are:

  • Viewstamped Replication (Liskov & Cowling, 1988) — historically the first complete consensus protocol for state machine replication
  • Raft (Ongaro & Ousterhout, 2014) — designed from the ground up to be understandable, now the backbone of etcd, CockroachDB, and many cloud systems

Both algorithms solve the same problem with surprisingly similar shapes. Let's explore them interactively.

💡

The visualizations below are live simulations. Click any node to disconnect or reconnect it. Hit 📨 Client write to send a request — watch it travel from the client icon to the leader, then propagate outward as AppendEntries RPCs. The Replication Log panel on the right shows each node's log in real-time: filled squares are committed entries, hollow squares are uncommitted. Use ⚡ Partition 3|2 to split the cluster and observe what each half can do.

Raft

Raft decomposes the consensus problem into three largely independent sub-problems:

  1. Leader election — exactly one server must be authoritative at any term
  2. Log replication — the leader accepts writes and replicates them to followers
  3. Safety — once an entry is committed, it can never be overwritten or lost

Cluster roles

Every Raft server is always in one of three states:

StateColourDescription
Follower🔵 BluePassive; just applies entries from the leader
Candidate🟡 YellowCampaigning to become leader
Leader🔴 RedHandles all client writes; replicates to followers

Leader election and the election timeout

Each follower has an election timeout — the arc you see growing around each blue node. If that timer expires without receiving a heartbeat from the leader, the follower assumes the leader is dead and promotes itself to Candidate.

The candidate increments its term number (visible inside the node as v1, v2, …), votes for itself, and fires RequestVote (purple circles) at every other server. A server grants its vote if:

  • the candidate's term is at least as large as its own, and
  • it hasn't already voted for someone else this term

If the candidate collects votes from a majority of the cluster, it wins and becomes the new leader — and immediately starts sending AppendEntries heartbeats (blue circles) to reset everyone's timers and prevent a new election.

The randomised timeout window (each follower picks a different timeout) ensures that in most cases only one candidate starts an election at a time, avoiding perpetual split votes.

Replication Log

No entries yet
Speed1×
Leader / PrimaryFollower / BackupCandidateView-changeDisconnected
Client writeRequestVoteVoteGrantedVoteDeniedAppendEntriesAppendEntriesOK

Simulation started — RAFT

Try it: disconnect the red leader node (click it). You'll see the blue nodes' timer arcs drain. One runs out first, turns yellow, and fires purple RequestVote RPCs. Green VoteGranted circles come back. Once a majority responds, a new red leader appears and immediately sends blue heartbeats.

Reconnect the old leader — it receives a heartbeat from the new leader with a higher term, realises someone else is in charge, and quietly steps back to follower.

Log replication

Once a leader is elected, it handles all client writes. The flow is:

  1. Client → Leader: a write request arrives at the leader (the pink circle traveling from the client icon on the left)
  2. Leader appends the entry to its own log
  3. Leader sends AppendEntries to all followers in the same heartbeat round (the blue circles — these double as heartbeats when empty)
  4. Once a majority of servers acknowledge, the entry is committed — the log panel shows it turn from hollow to filled
  5. Leader piggybacks the new commitIndex in the next AppendEntries, so followers advance their commit pointer too

This majority requirement is the key safety property: a committed entry must be on enough servers that any future leader will have it in their log. Raft's Leader Completeness guarantee means a new leader is only elected if its log is at least as up-to-date as any majority — so committed entries are never lost.

Commit index, quorum, and uncommitted entries

Every Raft node tracks two numbers:

  • log.length — how many entries have been appended (proposed but not necessarily safe)
  • commitIndex — how many entries are durably committed (a quorum has acknowledged them)

The gap between the two is the uncommitted tail — entries the leader has proposed but hasn't yet heard back from a quorum about. These are shown as hollow squares in the log panel. Uncommitted entries are not safe: if the leader crashes before reaching quorum, a new leader may have a shorter log, and any follower with extra entries must truncate them back.

An example walkthrough:

  1. Click ⚡ Partition 3|2 to isolate N4 and N5 from the majority.
  2. Hit 📨 Client write two or three times. Watch the majority partition (N1–N3) commit those entries — their log panel squares fill in solid.
  3. Now click the leader to disconnect it mid-term.
  4. A new leader is elected in the majority. If any follower received entries that were never acknowledged by a quorum, those entries show hollow; when the new leader's first AppendEntries arrives, you'll see the event log print ✂ Nₓ removes N uncommitted entries and the hollow squares disappear.
  5. Click 🔗 Heal partition — N4 and N5 rejoin as followers and the current leader sends them the missing committed entries, catching them up.

Why is this safe? The new leader must have a log at least as up-to-date as a majority (enforced by the vote-granting check), so committed entries are always present in the winner's log. Only entries that never reached quorum can be overwritten — and by definition those were never acknowledged to any client.

Catch-up: AppendEntries and InstallSnapshot

When a follower falls behind — perhaps it was offline, or sits in the minority partition — the leader uses AppendEntries to replay the missing suffix of the log. The leader decrements its estimate of the follower's position (nextIndex) until the logs agree, then sends all entries from that point forward. In the log panel you can watch a rejoining node's squares fill in one heartbeat round at a time.

For nodes that are very far behind — hours of log entries in a production cluster — replaying the full log would be too slow. Raft handles this with InstallSnapshot: the leader compresses its entire state machine into a snapshot, ships it to the lagging follower in one RPC, and the follower jumps straight to the snapshot point, discarding its old log up to that index. This is how systems like etcd and CockroachDB handle nodes that have been offline for extended periods.


Viewstamped Replication

VSR (first described in 1988 and revised by Liskov & Cowling in 2012) frames the problem as state machine replication: keep replicas of a state machine in sync so that the system continues to work even when some replicas fail.

Terminology

VSR uses different names for the same concepts:

VSR termRaft equivalent
Primary (red)Leader
Backup (blue)Follower
ViewTerm
View changeLeader election
Op-numberLog index

Normal operation

During normal operation, the primary receives client requests and drives replication:

  1. Primary sends Prepare(view, op-number, request) to all backups (blue circles)
  2. Each backup increments its op-number, appends the entry, and responds PrepareOK (teal circles)
  3. When the primary has collected PrepareOK from a majority, the entry is committed
  4. Primary broadcasts Commit (yellow circles) so backups can advance their commit pointer

The primary also sends periodic Prepare messages even with no new client operations, acting as a heartbeat and keeping backups' timers reset.

Replication Log

No entries yet
Speed1×
Leader / PrimaryFollower / BackupCandidateView-changeDisconnected
Client writePreparePrepareOKCommitStartViewChangeDoViewChangeStartView

Simulation started — VSR

Try it: disconnect the primary. Backups notice the silence, their timers expire, and they kick off a view change. Red StartViewChange messages propagate through the cluster. Purple DoViewChange messages converge on the deterministically chosen new primary (the node at index view mod N). Once it collects a majority, it fires yellow StartView to announce the new view, and normal Prepare replication resumes.

View change protocol

The VSR view change is more structured than Raft's election:

  1. A backup whose timer expires increments the view number and broadcasts StartViewChange(v+1) to all replicas
  2. Any replica receiving this message also broadcasts it (propagation), and sends DoViewChange to the designated new primary — deterministically the node at index v+1 mod N
  3. The new primary waits for DoViewChange from a majority, then starts the new view with StartView

The key difference from Raft: the new leader is not elected by competition — it is mathematically determined by the view number. If that node is also down, the view number keeps incrementing until an available node lands on its slot.


Raft vs. Viewstamped Replication

Both algorithms tolerate up to ⌊(N−1)/2⌋ simultaneous failures in a cluster of N servers. A 5-node cluster can lose 2 nodes and still make progress.

PropertyRaftVSR
New leader selectionFirst candidate to win a majority voteDeterministic: view mod N
Heartbeat mechanismAppendEntries (also carries log entries)Prepare + separate Commit
View/term trackingcurrentTermview-number
Vote deduplicationvotedFor per termImplicit via DoViewChange quorum
Historical roleModern, pedagogical (2014)Pioneer; predates Paxos (1988)

Conceptually, the two are duals: Raft handles leader election and log replication in the same RPC (AppendEntries), while VSR separates replication (Prepare/PrepareOK) from commit notification (Commit) and keeps view changes as a distinct three-phase protocol.

Both algorithms require a quorum — a strict majority of servers — to make any progress. This is unavoidable: it is the minimum requirement to prevent split-brain (two leaders simultaneously accepting writes). With N=5, you need 3 servers alive. Try adding nodes in the simulation and then disconnecting several at once to see when the cluster halts.

What the simulation doesn't show

The visualisation is intentionally simplified to keep it legible:

  • Log contents are simplified — entries only carry the term they were created in, not actual data
  • Disk writes are omitted (in real systems, entries are fsynced before acknowledging)
  • One-directional partitions are not modelled — the partition button disconnects both directions symmetrically
  • Membership changes (adding/removing nodes without downtime via joint-consensus) are not modelled
  • Pre-vote and leadership transfer extensions (common in production implementations) are omitted

For a deeper treatment of these topics, the original papers are surprisingly readable.

Further reading

Related Articles

·12 min read

The Challenges of WebGL

WebGL lets you run shaders in the browser — and that's genuinely magical. But using it for arbitrary GPU computation quickly reveals a thicket of constraints that OpenCL and WebGPU were designed to escape.