-
Notifications
You must be signed in to change notification settings - Fork 928
Only call emptyData after RDB validation #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Only call emptyData after RDB validation #2600
Conversation
5acfe7b
to
0daaa13
Compare
if (!rdbIsVersionAccepted(rdbver, is_valkey_magic, is_redis_magic)) { | ||
serverLog(LL_WARNING, "Can't handle RDB format version %d", rdbver); | ||
return C_ERR; | ||
return RDB_INCOMPATIBLE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look at rdb.tcl / replication.tcl to add integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed above, but testing this is challenging with the TCL framework. I've reproducible demo of the fix, and there exist tests that cover the code paths already. If anyone has any ideas for covering an incompatible RDB load during a full sync, I'd be open to suggesitons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into adding a change to debug.c that allows us to hit this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We hit this code path also when loading an RDB from a file? Then we can test this instead of a full sync. We can copy a file into the right place and then call DEBUG RELOAD NOSAVE to load it. If the load fails because of bad signature or future version, we shouldn't flush the databases. Is this right?
There are some RDB files here, for example one with the hyptetical future RDB version 987: https://github.com/valkey-io/valkey/tree/unstable/tests/assets
28a58fb
to
7a5c3a9
Compare
src/rdb.c
Outdated
} else { | ||
serverLog(LL_WARNING, "Wrong signature trying to load DB from file"); | ||
return C_ERR; | ||
/* Signal to terminate gracefully without clearing existing data */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update the comment here? Just don't want this (terminate gracefully
) to be misinterpreted as a shutdown
src/rdb.c
Outdated
/* Only empty data if RDBFLAGS_EMPTY_DATA is set */ | ||
if (rdbflags & RDBFLAGS_EMPTY_DATA) { | ||
int empty_db_flags = server.repl_replica_lazy_flush ? EMPTYDB_ASYNC : EMPTYDB_NO_FLAGS; | ||
serverLog(LL_NOTICE, "RDB compatability check passed. Flushing old data"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is more of a RDB signature and version validation. It does not check for full compatibility since unknown Module data types may exist in an RDB with valid signature & version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ChiliPaneer, with this change we are ensuring data flush happens only on verification of RDB compatibility.
Let's add a test case with incompatible RDB and verify old data is preserved after the failure of data loading. Thanks.
src/rdb.h
Outdated
ssize_t rdbSaveFunctions(rio *rdb); | ||
rdbSaveInfo *rdbPopulateSaveInfo(rdbSaveInfo *rsi); | ||
int saveSnapshotToConnectionSockets(rdbSnapshotOptions options); | ||
void replicationEmptyDbCallback(hashtable *d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
void replicationEmptyDbCallback(hashtable *d); | |
void replicationEmptyDbCallback(hashtable *ht); |
src/rdb.c
Outdated
* otherwise C_ERR is returned. | ||
/* Load an RDB file from the rio stream 'rdb'. We return one of the following: | ||
* - RDB_OK On success | ||
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version | |
* - RDB_INCOMPATIBLE If the RDB has an invalid signature or version |
src/rdb.c
Outdated
* - RDB_OK On success | ||
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version | ||
* - RDB_FAILED | ||
* if the RDB is has an invalid signature or version we return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems duplicate of the above. We could maybe mention in all other failure cases.
src/replication.c
Outdated
emptyData(-1, empty_db_flags, replicationEmptyDbCallback); | ||
/* If we received RDB_INCOMPATIBLE, the old data was preserved */ | ||
if (retval == RDB_INCOMPATIBLE) { | ||
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: No data was loaded, skipping discard step"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: No data was loaded, skipping discard step"); | |
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: RDB version incompatible, preserving old data"); |
src/server.h
Outdated
#define RDB_OK 0 | ||
#define RDB_NOT_EXIST 1 /* RDB file doesn't exist. */ | ||
#define RDB_FAILED 2 /* Failed to load the RDB file. */ | ||
#define RDB_INCOMPATIBLE 3/* RDB is incompatible with this version */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this formatted correctly? Seems like a space should be followed after the end of the macro.
src/server.h
Outdated
int replicationCountAcksByOffset(long long offset); | ||
int replicationCountAOFAcksByOffset(long long offset); | ||
void replicationSendNewlineToPrimary(void); | ||
void replicationEmptyDbCallback(hashtable *d); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to define it here as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, will remove
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2600 +/- ##
============================================
+ Coverage 72.22% 72.24% +0.01%
============================================
Files 127 127
Lines 70777 70827 +50
============================================
+ Hits 51119 51166 +47
- Misses 19658 19661 +3
🚀 New features to boost your workflow:
|
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
89ffc58
to
2d5a96c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks LGTM. Let's figure out a mechanism to add test to verify flush data doesn't cause cleanup of data.
@zuiderkwast / @enjoy-binbin Please take a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general.
/* Remove the half-loaded data in case the load failed for other reasons. */ | ||
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: Discarding the half-loaded data"); | ||
emptyData(-1, empty_db_flags, replicationEmptyDbCallback); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bracket is mis-indented.
int empty_data_flag = server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB ? RDBFLAGS_EMPTY_DATA : RDBFLAGS_NONE; | ||
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, RDBFLAGS_REPLICATION | empty_data_flag, rsi, &loadingCtx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this logic look simpler if we use a flags variable and conditionally add the EMPTY_DATA flag to it? I think so...
int empty_data_flag = server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB ? RDBFLAGS_EMPTY_DATA : RDBFLAGS_NONE; | |
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, RDBFLAGS_REPLICATION | empty_data_flag, rsi, &loadingCtx); | |
int flags = RDBFLAGS_REPLICATION; | |
if (server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB) flags |= RDBFLAGS_EMPTY_DATA; | |
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, flags, rsi, &loadingCtx); |
Addresses: #2588
Overview
Previously we call
emptyData()
during a fullSync before validating the RDB version is compatible.This change adds an rdb flag that allows us to flush the database from within
rdbLoadRioWithLoadingCtx
. THhis provides the option to only flush the data if the rdb has a valid version and signature. In the case where we do have an invalid version and signature, we don't emptyData, so if a full sync fails for that reason a replica can still serve stale data instead of clients experiencing cache misses.Changes
RDBFLAGS_EMPTY_DATA
that signals to flush the database after rdb validationemptyData
inrdbLoadRioWithLoadingCtx
inrdb.c
replication.c
using new return typeRDB_INCOMPATIBLE
rdbLoadRioWithLoadingCtx
to return RDB success codes and updated all calling sites.Testing
It's difficult to test that the data is NOT emptied in the tcl testing framework because we need a full sync with a corrupted rdb, and the rdb's for full sync are loaded from another node. To test this locally, I compiled a version that creates RDBs with a bad magic string

"BADKEY"
instead of"VALKEY"
and forced a full sync on a replica withconfig set repl-backlog-size 1
. The result is in the screenshot below.We can see that we avoided flushing the data when the RDB wasn't compatible.
A test already exists that checks that the data is flushed: https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504