Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717

zhijun42 · 2025-10-10T13:54:59Z

I found this issue in Redis repo redis/redis#14313 and confirmed that it also exists in Valkey. The description is pretty lengthy and detailed, and the TL;DR is that during failover sentinels will pick whatever slave instance it discovers to promote, which can cause serious data leakage if that particular slave server actually contains data of another customer/tenant.

I spent a couple weeks on this issue and got it working as expected. To make it easier for code reading, I broke up the code changes into smaller patches for review. This is the first part.

(Note that I'm using term master/slave and I'm aware that it's the same as primary/replica)

How to reproduce the issue

We need to run two clusters and have two sentinel instances monitoring each one.

Duplicate the file sentinel.conf as sentinel-a.conf and sentinel-b.conf. Add sentinel monitor cluster-a 127.0.0.1 6379 1 to sentinel-a.conf, and sentinel monitor cluster-b 127.0.0.1 6380 1 to sentinel-a.conf. Then replace all occurrence of "mymaster" to "cluster-a"/"cluster-b" correspondingly.

At the root directory of the repo:

# Set up the valkey server clusters
./src/valkey-server --port 6379  # this is cluster A
./src/valkey-server --port 6380  # this is cluster B
./src/valkey-server --port 6381 --replicaof 127.0.0.1 6379  # slave in cluster A

# Set up the sentinel instances
./src/valkey-server sentinel-a.conf --sentinel --port 26379
./src/valkey-server sentinel-b.conf --sentinel --port 26380

# Insert data
./src/valkey-cli -p 6379 set "key-cluster-a" "value-a"
./src/valkey-cli -p 6380 set "key-cluster-b" "value-b"

# Verify that the slave has replicated the data
./src/valkey-cli -p 6381 get "key-cluster-a"

# Manually ask the slave to follow a diffrent master
./src/valkey-cli -p 6381 replicaof 127.0.0.1 6380

Now shut down the master server (port 6379) in cluster A and we will see sentinel A starting failover trying to promote a candidate slave. Since the only slave 6381now belongs to a diffrent cluster (cluster B), the failover should fail with an event -failover-abort-no-good-slave, but the sentinel A still thinks slave 6381 is following master A 6379 and thus promotes 6381 to become the new master in cluster A.

Changes I made

The challenging part of solving this issue is to set up tests. The entire sentinel test suite is designed to have only one master-slave cluster with the master named "mymaster", but my code changes would require spinning up two clusters so I had to refactor the tests/sentinel/tests/includes/init-tests.tcl file to support so.

I also added a simple test in file 04-replica-selection.tcl to verify that the second cluster is now working. This file will be where I implement the tests on sentinels verifying replication ID during failover once this PR is merged. Interestingly, this file was intended to test the slave selection procedure but the test was never implemented since 2014.

Signed-off-by: Zhijun <[email protected]>

codecov · 2025-10-10T14:14:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.57%. Comparing base (8182f4a) to head (00efef6).

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2717      +/-   ##
============================================
- Coverage     72.58%   72.57%   -0.01%     
============================================
  Files           128      128              
  Lines         71277    71277              
============================================
- Hits          51736    51730       -6     
- Misses        19541    19547       +6

see 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

hwware · 2025-10-15T14:38:58Z

Thanks for your contribution, I will take a look it.

Sentinel: Refactor the test design to allow two clusters

00efef6

Signed-off-by: Zhijun <[email protected]>

github-actions bot assigned zhijun42 Oct 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717

Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717

zhijun42 commented Oct 10, 2025

Uh oh!

codecov bot commented Oct 10, 2025

Uh oh!

hwware commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717

Are you sure you want to change the base?

Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717

Conversation

zhijun42 commented Oct 10, 2025

How to reproduce the issue

Changes I made

Uh oh!

codecov bot commented Oct 10, 2025

Codecov Report

Uh oh!

hwware commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants