Skip to content

Conversation

ChiliPaneer
Copy link

@ChiliPaneer ChiliPaneer commented Sep 10, 2025

Addresses: #2588

Overview

Previously we call emptyData() during a fullSync before validating the RDB version is compatible.

This change adds an rdb flag that allows us to flush the database from within rdbLoadRioWithLoadingCtx. THhis provides the option to only flush the data if the rdb has a valid version and signature. In the case where we do have an invalid version and signature, we don't emptyData, so if a full sync fails for that reason a replica can still serve stale data instead of clients experiencing cache misses.

Changes

  • Added a new flag RDBFLAGS_EMPTY_DATA that signals to flush the database after rdb validation
  • Added logic to call emptyData in rdbLoadRioWithLoadingCtx in rdb.c
  • Added logic to not clear data if the RDB validation fails in replication.c using new return type RDB_INCOMPATIBLE
  • Modified the signature of rdbLoadRioWithLoadingCtx to return RDB success codes and updated all calling sites.

Testing

It's difficult to test that the data is NOT emptied in the tcl testing framework because we need a full sync with a corrupted rdb, and the rdb's for full sync are loaded from another node. To test this locally, I compiled a version that creates RDBs with a bad magic string "BADKEY" instead of "VALKEY" and forced a full sync on a replica with config set repl-backlog-size 1. The result is in the screenshot below.
image

We can see that we avoided flushing the data when the RDB wasn't compatible.

A test already exists that checks that the data is flushed: https://github.com/valkey-io/valkey/blob/unstable/tests/integration/replication.tcl#L1504

@ChiliPaneer ChiliPaneer marked this pull request as ready for review September 10, 2025 23:53
@ChiliPaneer ChiliPaneer marked this pull request as draft September 10, 2025 23:54
@ChiliPaneer ChiliPaneer force-pushed the emptyData_after_rdb_validation branch from 5acfe7b to 0daaa13 Compare September 11, 2025 00:09
if (!rdbIsVersionAccepted(rdbver, is_valkey_magic, is_redis_magic)) {
serverLog(LL_WARNING, "Can't handle RDB format version %d", rdbver);
return C_ERR;
return RDB_INCOMPATIBLE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look at rdb.tcl / replication.tcl to add integration tests

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed above, but testing this is challenging with the TCL framework. I've reproducible demo of the fix, and there exist tests that cover the code paths already. If anyone has any ideas for covering an incompatible RDB load during a full sync, I'd be open to suggesitons.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will look into adding a change to debug.c that allows us to hit this code path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hit this code path also when loading an RDB from a file? Then we can test this instead of a full sync. We can copy a file into the right place and then call DEBUG RELOAD NOSAVE to load it. If the load fails because of bad signature or future version, we shouldn't flush the databases. Is this right?

There are some RDB files here, for example one with the hyptetical future RDB version 987: https://github.com/valkey-io/valkey/tree/unstable/tests/assets

@ChiliPaneer ChiliPaneer force-pushed the emptyData_after_rdb_validation branch 2 times, most recently from 28a58fb to 7a5c3a9 Compare September 12, 2025 10:51
@ChiliPaneer ChiliPaneer marked this pull request as ready for review September 12, 2025 10:57
@hpatro hpatro requested review from hpatro and removed request for KarthikSubbarao September 12, 2025 17:39
src/rdb.c Outdated
} else {
serverLog(LL_WARNING, "Wrong signature trying to load DB from file");
return C_ERR;
/* Signal to terminate gracefully without clearing existing data */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update the comment here? Just don't want this (terminate gracefully ) to be misinterpreted as a shutdown

src/rdb.c Outdated
/* Only empty data if RDBFLAGS_EMPTY_DATA is set */
if (rdbflags & RDBFLAGS_EMPTY_DATA) {
int empty_db_flags = server.repl_replica_lazy_flush ? EMPTYDB_ASYNC : EMPTYDB_NO_FLAGS;
serverLog(LL_NOTICE, "RDB compatability check passed. Flushing old data");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is more of a RDB signature and version validation. It does not check for full compatibility since unknown Module data types may exist in an RDB with valid signature & version.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ChiliPaneer, with this change we are ensuring data flush happens only on verification of RDB compatibility.

Let's add a test case with incompatible RDB and verify old data is preserved after the failure of data loading. Thanks.

src/rdb.h Outdated
ssize_t rdbSaveFunctions(rio *rdb);
rdbSaveInfo *rdbPopulateSaveInfo(rdbSaveInfo *rsi);
int saveSnapshotToConnectionSockets(rdbSnapshotOptions options);
void replicationEmptyDbCallback(hashtable *d);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
void replicationEmptyDbCallback(hashtable *d);
void replicationEmptyDbCallback(hashtable *ht);

src/rdb.c Outdated
* otherwise C_ERR is returned.
/* Load an RDB file from the rio stream 'rdb'. We return one of the following:
* - RDB_OK On success
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version
* - RDB_INCOMPATIBLE If the RDB has an invalid signature or version

src/rdb.c Outdated
* - RDB_OK On success
* - RDB_INCOMPATIBLE If the RDB is has an invalid signature or version
* - RDB_FAILED
* if the RDB is has an invalid signature or version we return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems duplicate of the above. We could maybe mention in all other failure cases.

emptyData(-1, empty_db_flags, replicationEmptyDbCallback);
/* If we received RDB_INCOMPATIBLE, the old data was preserved */
if (retval == RDB_INCOMPATIBLE) {
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: No data was loaded, skipping discard step");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: No data was loaded, skipping discard step");
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: RDB version incompatible, preserving old data");

src/server.h Outdated
#define RDB_OK 0
#define RDB_NOT_EXIST 1 /* RDB file doesn't exist. */
#define RDB_FAILED 2 /* Failed to load the RDB file. */
#define RDB_INCOMPATIBLE 3/* RDB is incompatible with this version */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this formatted correctly? Seems like a space should be followed after the end of the macro.

src/server.h Outdated
int replicationCountAcksByOffset(long long offset);
int replicationCountAOFAcksByOffset(long long offset);
void replicationSendNewlineToPrimary(void);
void replicationEmptyDbCallback(hashtable *d);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define it here as well ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, will remove

Copy link

codecov bot commented Sep 16, 2025

Codecov Report

❌ Patch coverage is 78.12500% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.24%. Comparing base (3213f48) to head (68950b5).
⚠️ Report is 5 commits behind head on unstable.

Files with missing lines Patch % Lines
src/replication.c 61.53% 5 Missing ⚠️
src/rdb.c 87.50% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2600      +/-   ##
============================================
+ Coverage     72.22%   72.24%   +0.01%     
============================================
  Files           127      127              
  Lines         70777    70827      +50     
============================================
+ Hits          51119    51166      +47     
- Misses        19658    19661       +3     
Files with missing lines Coverage Δ
src/aof.c 81.10% <100.00%> (ø)
src/debug.c 53.94% <100.00%> (+0.04%) ⬆️
src/module.c 9.79% <ø> (+0.09%) ⬆️
src/rdb.h 100.00% <ø> (ø)
src/server.h 100.00% <ø> (ø)
src/rdb.c 76.98% <87.50%> (-0.14%) ⬇️
src/replication.c 86.85% <61.53%> (+0.05%) ⬆️

... and 15 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Venkat Pamulapati added 9 commits September 22, 2025 23:58
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Signed-off-by: Venkat Pamulapati <[email protected]>
Venkat Pamulapati added 2 commits September 22, 2025 23:58
@ChiliPaneer ChiliPaneer force-pushed the emptyData_after_rdb_validation branch from 89ffc58 to 2d5a96c Compare September 22, 2025 23:58
@ChiliPaneer ChiliPaneer requested a review from hpatro September 23, 2025 00:04
Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks LGTM. Let's figure out a mechanism to add test to verify flush data doesn't cause cleanup of data.

@hpatro
Copy link
Collaborator

hpatro commented Sep 23, 2025

@zuiderkwast / @enjoy-binbin Please take a look.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general.

/* Remove the half-loaded data in case the load failed for other reasons. */
serverLog(LL_NOTICE, "PRIMARY <-> REPLICA sync: Discarding the half-loaded data");
emptyData(-1, empty_db_flags, replicationEmptyDbCallback);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bracket is mis-indented.

Comment on lines +2300 to +2301
int empty_data_flag = server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB ? RDBFLAGS_EMPTY_DATA : RDBFLAGS_NONE;
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, RDBFLAGS_REPLICATION | empty_data_flag, rsi, &loadingCtx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this logic look simpler if we use a flags variable and conditionally add the EMPTY_DATA flag to it? I think so...

Suggested change
int empty_data_flag = server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB ? RDBFLAGS_EMPTY_DATA : RDBFLAGS_NONE;
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, RDBFLAGS_REPLICATION | empty_data_flag, rsi, &loadingCtx);
int flags = RDBFLAGS_REPLICATION;
if (server.repl_diskless_load != REPL_DISKLESS_LOAD_SWAPDB) flags |= RDBFLAGS_EMPTY_DATA;
int retval = rdbLoadRioWithLoadingCtxScopedRdb(&rdb, flags, rsi, &loadingCtx);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants