Skip to content

Add test for null snapshot recovery in create_sync_snapshot_req#97

Merged
antonio2368 merged 2 commits intomasterfrom
fix-snapshot-add-test
Mar 19, 2026
Merged

Add test for null snapshot recovery in create_sync_snapshot_req#97
antonio2368 merged 2 commits intomasterfrom
fix-snapshot-add-test

Conversation

@pamarcos
Copy link
Copy Markdown
Member

The previous commit replaced the fatal system_exit/abort in create_sync_snapshot_req with graceful recovery (clear sync state, retry on next heartbeat). That fix has two branches:

  1. snp == nullptr: snapshot is unavailable, log warning and retry
  2. last_log_idx > snp->get_last_log_idx(): peer advanced past snapshot, clear sync flag and retry

The existing snapshot_stale_sync_flag_test only covers the caller- side guard in create_append_entries_req, which prevents entering create_sync_snapshot_req when the peer has caught up. It does not exercise the recovery paths inside create_sync_snapshot_req itself (the old code had LCOV_EXCL markers confirming zero coverage).

This commit adds snapshot_null_snapshot_join_test, which exercises branch (1) via the handle_join_leave -> sync_log_to_new_srv path:

  • Set up a 2-node cluster with compacted logs
  • Null out the leader's last_snapshot_ via a new test helper
  • Add a new server whose start_idx < log_store_->start_index(), forcing create_sync_snapshot_req with a null snapshot
  • Verify the leader does NOT abort and recovers after the snapshot is restored by a subsequent snapshot creation cycle

To support this, a clear_last_snapshot helper is added to raft_server_handler and exposed through FakeNetwork.

The previous commit replaced the fatal system_exit/abort in
create_sync_snapshot_req with graceful recovery (clear sync state,
retry on next heartbeat). That fix has two branches:

  1. snp == nullptr: snapshot is unavailable, log warning and retry
  2. last_log_idx > snp->get_last_log_idx(): peer advanced past
     snapshot, clear sync flag and retry

The existing snapshot_stale_sync_flag_test only covers the caller-
side guard in create_append_entries_req, which prevents entering
create_sync_snapshot_req when the peer has caught up. It does not
exercise the recovery paths inside create_sync_snapshot_req itself
(the old code had LCOV_EXCL markers confirming zero coverage).

This commit adds snapshot_null_snapshot_join_test, which exercises
branch (1) via the handle_join_leave -> sync_log_to_new_srv path:

  - Set up a 2-node cluster with compacted logs
  - Null out the leader's last_snapshot_ via a new test helper
  - Add a new server whose start_idx < log_store_->start_index(),
    forcing create_sync_snapshot_req with a null snapshot
  - Verify the leader does NOT abort and recovers after the
    snapshot is restored by a subsequent snapshot creation cycle

To support this, a clear_last_snapshot helper is added to
raft_server_handler and exposed through FakeNetwork.
@pamarcos pamarcos requested a review from antonio2368 March 18, 2026 14:51
@antonio2368 antonio2368 requested a review from Copilot March 19, 2026 07:47
@antonio2368 antonio2368 changed the base branch from fix-snapshot to master March 19, 2026 07:49
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a unit test to cover the “null snapshot” recovery branch inside create_sync_snapshot_req, ensuring the leader does not abort when get_last_snapshot() returns nullptr during a join-driven snapshot sync.

Changes:

  • Add snapshot_null_snapshot_join_test to exercise join → sync_log_to_new_srvcreate_sync_snapshot_req when the leader snapshot pointer is null.
  • Expose a test-only helper to clear raft_server::last_snapshot_ via raft_server_handler and FakeNetwork.
  • Register the new test in snapshot_test’s main.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
tests/unit/snapshot_test.cxx Adds a new join-based unit test that simulates last_snapshot_ == nullptr and verifies the leader recovers and the new node catches up.
tests/unit/fake_network.hxx Exposes clearLastSnapshot helper to tests via FakeNetwork.
include/libnuraft/raft_server_handler.hxx Adds a protected test helper to clear the server’s last snapshot pointer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +1061 to +1070
// Retry the join: trigger heartbeats and drive the protocol
// until S3 finishes receiving the snapshot and catches up.
for (int ii = 0; ii < 10; ++ii) {
s1.fTimer->invoke( timer_task_type::heartbeat_timer );
s1.fNet->execReqResp();
s1.fNet->execReqResp();
CHK_Z( wait_for_sm_exec(pkgs, COMMIT_TIMEOUT_SEC) );
if (!s3.raftServer->is_receiving_snapshot()) break;
}


// Verify snapshot was created and logs compacted.
CHK_OK( s2.getTestSm()->isSame( *s1.getTestSm() ) );

@antonio2368 antonio2368 merged commit 9d78ffb into master Mar 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants