Skip to content

Conversation

zhijun42
Copy link
Contributor

I found this issue in Redis repo redis/redis#14313 and confirmed that it also exists in Valkey. The description is pretty lengthy and detailed, and the TL;DR is that during failover sentinels will pick whatever slave instance it discovers to promote, which can cause serious data leakage if that particular slave server actually contains data of another customer/tenant.

I spent a couple weeks on this issue and got it working as expected. To make it easier for code reading, I broke up the code changes into smaller patches for review. This is the first part.

(Note that I'm using term master/slave and I'm aware that it's the same as primary/replica)

How to reproduce the issue

We need to run two clusters and have two sentinel instances monitoring each one.

Duplicate the file sentinel.conf as sentinel-a.conf and sentinel-b.conf. Add sentinel monitor cluster-a 127.0.0.1 6379 1 to sentinel-a.conf, and sentinel monitor cluster-b 127.0.0.1 6380 1 to sentinel-a.conf. Then replace all occurrence of "mymaster" to "cluster-a"/"cluster-b" correspondingly.

At the root directory of the repo:

# Set up the valkey server clusters
./src/valkey-server --port 6379  # this is cluster A
./src/valkey-server --port 6380  # this is cluster B
./src/valkey-server --port 6381 --replicaof 127.0.0.1 6379  # slave in cluster A

# Set up the sentinel instances
./src/valkey-server sentinel-a.conf --sentinel --port 26379
./src/valkey-server sentinel-b.conf --sentinel --port 26380

# Insert data
./src/valkey-cli -p 6379 set "key-cluster-a" "value-a"
./src/valkey-cli -p 6380 set "key-cluster-b" "value-b"

# Verify that the slave has replicated the data
./src/valkey-cli -p 6381 get "key-cluster-a"

# Manually ask the slave to follow a diffrent master
./src/valkey-cli -p 6381 replicaof 127.0.0.1 6380

Now shut down the master server (port 6379) in cluster A and we will see sentinel A starting failover trying to promote a candidate slave. Since the only slave 6381now belongs to a diffrent cluster (cluster B), the failover should fail with an event -failover-abort-no-good-slave, but the sentinel A still thinks slave 6381 is following master A 6379 and thus promotes 6381 to become the new master in cluster A.

Changes I made

The challenging part of solving this issue is to set up tests. The entire sentinel test suite is designed to have only one master-slave cluster with the master named "mymaster", but my code changes would require spinning up two clusters so I had to refactor the tests/sentinel/tests/includes/init-tests.tcl file to support so.

I also added a simple test in file 04-replica-selection.tcl to verify that the second cluster is now working. This file will be where I implement the tests on sentinels verifying replication ID during failover once this PR is merged. Interestingly, this file was intended to test the slave selection procedure but the test was never implemented since 2014.

Copy link

codecov bot commented Oct 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.57%. Comparing base (8182f4a) to head (00efef6).

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2717      +/-   ##
============================================
- Coverage     72.58%   72.57%   -0.01%     
============================================
  Files           128      128              
  Lines         71277    71277              
============================================
- Hits          51736    51730       -6     
- Misses        19541    19547       +6     

see 17 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@hwware
Copy link
Member

hwware commented Oct 15, 2025

Thanks for your contribution, I will take a look it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants