Here’s a protocol-level review focused only on VSR/VR* semantics (not syntax, style, or message plumbing). I’ll go top-down and call out safety first, then liveness & spec deviations, and finish with concrete fixes you can drop in.
* I’m referencing the “Viewstamped Replication Revisited” terminology.
Solid / on-track
- Primary drives the log with
Prepare; replicas accept only atlog_length + 1→ preserves the “no holes” property. - ACK accounting is keyed by
{view, op}so cross-view contamination is avoided. - Commit on the primary requires > N/2
PrepareOk(you pre-seed count with the leader’s own write), and you broadcastCommit. - View change: you gather
DoViewChangeon the primary-of-v, pick the log with max(last_normal_view, op_number)and setcommit_number := max(commits)→ that’s the core VR merge rule. - Receiving
Preparefrom a higher view causes the follower to adopt the new view immediately (good for quick convergence).
Safety bugs / correctness gaps
-
Replica does not check for log conflicts when
op_number <= log_length. Inprepare_impl/2you unconditionally sendPrepareOkifprepare.op_number <= log_length. If the entry atop_numberdiffers (different operation or view), you can acknowledge a conflicting history → violates VR’s log-matching invariant. -
Primary election must be deterministic; using arbitrary terms (especially PIDs) to order members is unsafe.
primary/1andprimary_for_view/2compute the leader byEnum.sort([self | replicas]). If node IDs are PIDs or otherwise not a globally stable, comparable ordering across processes/environments, different replicas can disagree on who the primary for a view is → split brain. -
Quorum size inconsistency (potentially unsafe with custom
cluster_size). Commit threshold usesconnected_replicas = MapSet.size(state.replicas) + 1, while other places referencestate.cluster_size. If a caller setscluster_sizelarger than|replicas|+1, you could commit with fewer than a strict majority ofcluster_size. -
Conflict recovery only handles “gaps”, not “mismatch”. You request
GetStatewhenprepare.op_number > log_length + 1. You don’t request recovery (or truncate) whenop_number <= log_lengthbut contents mismatch. That leaves the system able to drift into divergent logs that you still ACK.
Spec / behavior deviations & liveness nits (not strictly unsafe)
5. Start-View-Change handshake naming/flow.
You use StartViewChange + broadcast StartViewChangeAck, then emit DoViewChange after a majority of ACKs. This can work (your ACK is effectively a “me too, I’m in view-change v”), but it differs from the usual “count StartViewChange messages”. It’s okay if (and only if) nodes broadcast an ACK iff they have themselves entered view-change v (your code does that), but the intent is subtle.
-
GetState’s view field. When a follower sees a higher-viewPrepare, if it hits a gap you sendGetStatewithview: state.view_number(the old view), notprepare.view. Minor, but it makes it easier to reject state transfer incorrectly. -
Commit advancement semantics are fine but implicit.
maybe_commit_operation/2advancescommit_numberstraight to the op that just reached majority. That’s safe in VR since acceptance ati+1implies the same majority has≤ idue to “append only atlog_length + 1”, but it’s worth a comment in code to avoid accidental future changes. -
Client deduplication semantics are incomplete. If a duplicate request (same
client_id/request_id) arrives while the first is still processing, you “do nothing”. Only the originalfromgets the reply; the retried caller starves. That’s not a VR safety issue, but it breaks “at-most-once with retries”. -
Timers after adopting a new leader via
Prepare. When a node inappropriately believes it’s primary then adopts a higher-viewPrepare, its heartbeat timer is left to fire once (harmless) but isn’t actively canceled. Not unsafe—just noisy. -
NewStateacceptance does not reset per-view counters/timers. AfterNewState, you don’t clearprepare_ok_countor adjust timers based on role in the new view. Again not unsafe, but you’ll carry dead counters.
In prepare_impl/2, add a fetch and compare:
# inside the clause where prepare.view == state.view_number (or after adopting view)
log_length = get_log_length(state)
cond do
prepare.op_number == log_length + 1 ->
# existing path (append)
...
prepare.op_number <= log_length ->
case state.module.log_fetch(state.log, prepare.op_number) do
{:ok, %LogEntry{view: v, operation: op}} when v == prepare.view and op == prepare.operation ->
# exact match – idempotent ACK
send_prepare_ok(state, prepare)
{:noreply, state}
_ ->
# conflict: request state transfer (or truncate then catch up)
primary_node = prepare.leader_id || primary(state)
send_to(state, primary_node, %GetState{
view: prepare.view, # use leader's view
op_number: state.op_number,
sender: state.node_id
})
{:noreply, state}
end
Alternative: proactively truncate to op_number - 1 and then accept the leader’s entry (VR allows leader to overwrite uncommitted suffixes). If you have log_replace but not a targeted truncate, add a log_truncate_from(op_number) callback or re-build from log_get_from/2.
Avoid sorting PIDs/opaque terms. Store a shared, ordered list of replica IDs (e.g., atoms, integers, UUIDs) agreed by all members, and compute:
# At cluster setup
%{state | members: Enum.sort(stable_ids), cluster_size: length(stable_ids)}
defp primary_for_view(view, state) do
Enum.at(state.members, rem(view, state.cluster_size))
end
defp primary(state), do: primary_for_view(state.view_number, state)If you must keep replicas as a MapSet, add a distinct member_ids :: [term] that all nodes initialize identically and use only that for primary calculation.
Unify all majority checks on state.cluster_size:
has_quorum?/1: keepconnected_countif you’re actually tracking liveness; otherwise just usecluster_size(or remove the gate—right now it’s effectively always true).maybe_commit_operation/2: compare ACKs againstdiv(state.cluster_size, 2); notMapSet.size(state.replicas)+1. That way if someone explicitly setscluster_size, you won’t under-count required ACKs.
In both the “gap” and conflict paths:
%GetState{view: prepare.view, ...}…and optionally ignore NewState whose view < state.view_number.
You already clear prepare_ok_count in some places. Do the same when:
- Adopting a higher view via
Prepare - Accepting
NewState - Processing
StartView
(You handled two of those; add it to new_state_impl/2.)
VR commonly overwrites uncommitted suffixes. A concise approach:
- On conflict at
i: truncate log toi-1, setop_number := i-1, then treat thePrepareaslog_length + 1and append. - If there’s a gap (
i > log_length + 1), ask for state (GetState) or ask for entries fromlog_length+1.
This reduces full-state transfers and lines up with the leader-driven overwrite rule.
To make retries get the identical reply even while the original is “processing”:
- Keep a per-client waiter list (the
fromvalues) alongside the{:processing, request_id}entry, or - Immediately reply to duplicates with a special token like
{:inflight, request_id}and let the client poll, or - Cache
(request_id -> result)and on commit broadcast the result to all recordedfroms, then collapse to{:cached, request_id, result}.
Not a VR safety issue, but it’s the difference between “exactly-once view” that users expect and silent timeouts.
-
StartViewChange/Ackpath works because any node that enters view-change broadcasts an ack (which you treat as the “I’m in” message). The naming is unconventional; a future reader may assume you count the wrong thing. A comment there would help. -
NewState: after acceptance, consider callingreset_primary_inactivity_timer/1(you’re a follower) and clearingview_change_votes. -
Preparecarriescommit_number; followers can advance commits earlier by applying up to that number (it’s an optimization; safe either way since you also receiveCommit).
-
Conflict ACK is the only true red-flag: acknowledging a different value at the same
(view, op)can let two different leaders in the same/next view “prove” different logs with majorities that don’t intersect on content. That breaks linearizability. -
Deterministic primary prevents split-brain after view changes. Even with perfect quorums, if nodes disagree on who primary is for v, you can end up with two leaders issuing
Prepares in the same view. -
Quorum constants must reflect the configured cluster, not a local count, or you risk committing with < majority in mis-configuration scenarios.
Apply those three and the implementation lines up with the standard VR invariants; the rest are quality and convergence improvements.