Skip to content

Latest commit

 

History

History
171 lines (110 loc) · 9.73 KB

File metadata and controls

171 lines (110 loc) · 9.73 KB

Here’s a protocol-level review focused only on VSR/VR* semantics (not syntax, style, or message plumbing). I’ll go top-down and call out safety first, then liveness & spec deviations, and finish with concrete fixes you can drop in.

* I’m referencing the “Viewstamped Replication Revisited” terminology.


TL;DR (what’s correct vs. what’s risky)

Solid / on-track

  • Primary drives the log with Prepare; replicas accept only at log_length + 1 → preserves the “no holes” property.
  • ACK accounting is keyed by {view, op} so cross-view contamination is avoided.
  • Commit on the primary requires > N/2 PrepareOk (you pre-seed count with the leader’s own write), and you broadcast Commit.
  • View change: you gather DoViewChange on the primary-of-v, pick the log with max (last_normal_view, op_number) and set commit_number := max(commits) → that’s the core VR merge rule.
  • Receiving Prepare from a higher view causes the follower to adopt the new view immediately (good for quick convergence).

Safety bugs / correctness gaps

  1. Replica does not check for log conflicts when op_number <= log_length. In prepare_impl/2 you unconditionally send PrepareOk if prepare.op_number <= log_length. If the entry at op_number differs (different operation or view), you can acknowledge a conflicting history → violates VR’s log-matching invariant.

  2. Primary election must be deterministic; using arbitrary terms (especially PIDs) to order members is unsafe. primary/1 and primary_for_view/2 compute the leader by Enum.sort([self | replicas]). If node IDs are PIDs or otherwise not a globally stable, comparable ordering across processes/environments, different replicas can disagree on who the primary for a view is → split brain.

  3. Quorum size inconsistency (potentially unsafe with custom cluster_size). Commit threshold uses connected_replicas = MapSet.size(state.replicas) + 1, while other places reference state.cluster_size. If a caller sets cluster_size larger than |replicas|+1, you could commit with fewer than a strict majority of cluster_size.

  4. Conflict recovery only handles “gaps”, not “mismatch”. You request GetState when prepare.op_number > log_length + 1. You don’t request recovery (or truncate) when op_number <= log_length but contents mismatch. That leaves the system able to drift into divergent logs that you still ACK.

Spec / behavior deviations & liveness nits (not strictly unsafe) 5. Start-View-Change handshake naming/flow. You use StartViewChange + broadcast StartViewChangeAck, then emit DoViewChange after a majority of ACKs. This can work (your ACK is effectively a “me too, I’m in view-change v”), but it differs from the usual “count StartViewChange messages”. It’s okay if (and only if) nodes broadcast an ACK iff they have themselves entered view-change v (your code does that), but the intent is subtle.

  1. GetState’s view field. When a follower sees a higher-view Prepare, if it hits a gap you send GetState with view: state.view_number (the old view), not prepare.view. Minor, but it makes it easier to reject state transfer incorrectly.

  2. Commit advancement semantics are fine but implicit. maybe_commit_operation/2 advances commit_number straight to the op that just reached majority. That’s safe in VR since acceptance at i+1 implies the same majority has ≤ i due to “append only at log_length + 1”, but it’s worth a comment in code to avoid accidental future changes.

  3. Client deduplication semantics are incomplete. If a duplicate request (same client_id/request_id) arrives while the first is still processing, you “do nothing”. Only the original from gets the reply; the retried caller starves. That’s not a VR safety issue, but it breaks “at-most-once with retries”.

  4. Timers after adopting a new leader via Prepare. When a node inappropriately believes it’s primary then adopts a higher-view Prepare, its heartbeat timer is left to fire once (harmless) but isn’t actively canceled. Not unsafe—just noisy.

  5. NewState acceptance does not reset per-view counters/timers. After NewState, you don’t clear prepare_ok_count or adjust timers based on role in the new view. Again not unsafe, but you’ll carry dead counters.


Concrete fixes (minimal, surgical)

1) Detect and handle conflicts when op_number <= log_length

In prepare_impl/2, add a fetch and compare:

# inside the clause where prepare.view == state.view_number (or after adopting view)
log_length = get_log_length(state)

cond do
  prepare.op_number == log_length + 1 ->
    # existing path (append)
    ...

  prepare.op_number <= log_length ->
    case state.module.log_fetch(state.log, prepare.op_number) do
      {:ok, %LogEntry{view: v, operation: op}} when v == prepare.view and op == prepare.operation ->
        # exact match – idempotent ACK
        send_prepare_ok(state, prepare)
        {:noreply, state}

      _ ->
        # conflict: request state transfer (or truncate then catch up)
        primary_node = prepare.leader_id || primary(state)
        send_to(state, primary_node, %GetState{
          view: prepare.view,     # use leader's view
          op_number: state.op_number,
          sender: state.node_id
        })
        {:noreply, state}
    end

Alternative: proactively truncate to op_number - 1 and then accept the leader’s entry (VR allows leader to overwrite uncommitted suffixes). If you have log_replace but not a targeted truncate, add a log_truncate_from(op_number) callback or re-build from log_get_from/2.

2) Use a stable, explicit membership for leader selection

Avoid sorting PIDs/opaque terms. Store a shared, ordered list of replica IDs (e.g., atoms, integers, UUIDs) agreed by all members, and compute:

# At cluster setup
%{state | members: Enum.sort(stable_ids), cluster_size: length(stable_ids)}

defp primary_for_view(view, state) do
  Enum.at(state.members, rem(view, state.cluster_size))
end

defp primary(state), do: primary_for_view(state.view_number, state)

If you must keep replicas as a MapSet, add a distinct member_ids :: [term] that all nodes initialize identically and use only that for primary calculation.

3) Make quorum math consistently use cluster_size

Unify all majority checks on state.cluster_size:

  • has_quorum?/1: keep connected_count if you’re actually tracking liveness; otherwise just use cluster_size (or remove the gate—right now it’s effectively always true).
  • maybe_commit_operation/2: compare ACKs against div(state.cluster_size, 2); not MapSet.size(state.replicas)+1. That way if someone explicitly sets cluster_size, you won’t under-count required ACKs.

4) Use the leader’s view when requesting state

In both the “gap” and conflict paths:

%GetState{view: prepare.view, ...}

…and optionally ignore NewState whose view < state.view_number.

5) Clear per-view counters on any view adoption

You already clear prepare_ok_count in some places. Do the same when:

  • Adopting a higher view via Prepare
  • Accepting NewState
  • Processing StartView

(You handled two of those; add it to new_state_impl/2.)

6) Optional but recommended: prefer truncate-and-catch-up over GetState

VR commonly overwrites uncommitted suffixes. A concise approach:

  • On conflict at i: truncate log to i-1, set op_number := i-1, then treat the Prepare as log_length + 1 and append.
  • If there’s a gap (i > log_length + 1), ask for state (GetState) or ask for entries from log_length+1.

This reduces full-state transfers and lines up with the leader-driven overwrite rule.

7) Duplicate request handling (client semantics)

To make retries get the identical reply even while the original is “processing”:

  • Keep a per-client waiter list (the from values) alongside the {:processing, request_id} entry, or
  • Immediately reply to duplicates with a special token like {:inflight, request_id} and let the client poll, or
  • Cache (request_id -> result) and on commit broadcast the result to all recorded froms, then collapse to {:cached, request_id, result}.

Not a VR safety issue, but it’s the difference between “exactly-once view” that users expect and silent timeouts.


Smaller notes / tidy-ups

  • StartViewChange/Ack path works because any node that enters view-change broadcasts an ack (which you treat as the “I’m in” message). The naming is unconventional; a future reader may assume you count the wrong thing. A comment there would help.

  • NewState: after acceptance, consider calling reset_primary_inactivity_timer/1 (you’re a follower) and clearing view_change_votes.

  • Prepare carries commit_number; followers can advance commits earlier by applying up to that number (it’s an optimization; safe either way since you also receive Commit).


Why these fixes matter (safety lens)

  • Conflict ACK is the only true red-flag: acknowledging a different value at the same (view, op) can let two different leaders in the same/next view “prove” different logs with majorities that don’t intersect on content. That breaks linearizability.

  • Deterministic primary prevents split-brain after view changes. Even with perfect quorums, if nodes disagree on who primary is for v, you can end up with two leaders issuing Prepares in the same view.

  • Quorum constants must reflect the configured cluster, not a local count, or you risk committing with < majority in mis-configuration scenarios.

Apply those three and the implementation lines up with the standard VR invariants; the rest are quality and convergence improvements.