Sentinel: Refactor test design to allow two clusters to help avoid data leakage during failover #2717
+171
−73
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I found this issue in Redis repo redis/redis#14313 and confirmed that it also exists in Valkey. The description is pretty lengthy and detailed, and the TL;DR is that during failover sentinels will pick whatever slave instance it discovers to promote, which can cause serious data leakage if that particular slave server actually contains data of another customer/tenant.
I spent a couple weeks on this issue and got it working as expected. To make it easier for code reading, I broke up the code changes into smaller patches for review. This is the first part.
(Note that I'm using term master/slave and I'm aware that it's the same as primary/replica)
How to reproduce the issue
We need to run two clusters and have two sentinel instances monitoring each one.
Duplicate the file
sentinel.conf
assentinel-a.conf
andsentinel-b.conf
. Addsentinel monitor cluster-a 127.0.0.1 6379 1
tosentinel-a.conf
, andsentinel monitor cluster-b 127.0.0.1 6380 1
tosentinel-a.conf
. Then replace all occurrence of "mymaster" to "cluster-a"/"cluster-b" correspondingly.At the root directory of the repo:
Now shut down the master server (port 6379) in cluster A and we will see sentinel A starting failover trying to promote a candidate slave. Since the only slave 6381now belongs to a diffrent cluster (cluster B), the failover should fail with an event
-failover-abort-no-good-slave
, but the sentinel A still thinks slave 6381 is following master A 6379 and thus promotes 6381 to become the new master in cluster A.Changes I made
The challenging part of solving this issue is to set up tests. The entire sentinel test suite is designed to have only one master-slave cluster with the master named "mymaster", but my code changes would require spinning up two clusters so I had to refactor the
tests/sentinel/tests/includes/init-tests.tcl
file to support so.I also added a simple test in file
04-replica-selection.tcl
to verify that the second cluster is now working. This file will be where I implement the tests on sentinels verifying replication ID during failover once this PR is merged. Interestingly, this file was intended to test the slave selection procedure but the test was never implemented since 2014.