Skip to content

Conversation

sarthakaggarwal97
Copy link
Contributor

Resolves #2676

The workflow is passing after the change in my local repo.

Signed-off-by: Sarthak Aggarwal <[email protected]>
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.60%. Comparing base (28e5dcc) to head (bf74e8a).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2732      +/-   ##
============================================
- Coverage     72.61%   72.60%   -0.01%     
============================================
  Files           128      128              
  Lines         71278    71278              
============================================
- Hits          51755    51750       -5     
- Misses        19523    19528       +5     

see 11 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@hpatro hpatro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very familiar with this file. So, some rudimentary questions.

Is this a unique problem to blocked clients ? Would be nice if you could add why did we start observing this and what does this change does? Why is it under Request flow? -MOVED * shouldn't it be a response generated by the engine.

@sarthakaggarwal97
Copy link
Contributor Author

Thanks @hpatro for taking a look! If my understanding is correct, after #2329, we started returning -MOVED to blocked clients after the change of role in cluster mode outside of normal request / response flow.

The validator fails, because it is expecting to parse a request, but ends up with out of line -MOVED error. Essentially, -MOVED is still a response from server, just that the validator was expecting a request.

Comment on lines +93 to +96
# When blocked clients are redirected after a failover they may
# receive an async MOVED error that is logged without their request.
# These responses start with a RESP type prefix. Rewind the cursor
# to skip it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the MOVED errors logged without their request? Isn't that the source of this error? Can we fix that instead when we log the requests and responses?

Copy link
Contributor

@zuiderkwast zuiderkwast Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the 1:1 match between request logging and response logging is broken.

  • reqresAppendRequest() is called in processCommand() in server.c if not reprocessing command
  • reqresAppendResponse() is called in commandProcessed() in networking.c if the client is not blocked

In the PR that introduced this bug, on these lines https://github.com/valkey-io/valkey/pull/2329/files#diff-ddd6be3a7a9289fbf41890c34da16f1ea5b0a8d6d9206ae1ff5303e50bae1970R342 we call clusterRedirectBlockedClientIfNeeded() and directly afterwards unblockClient() and then the command is reprocessed. This means it gets one more reply?

Compare to the standalone case, we send the -REDIRECT error and then call unblockClientOnError(), which sets c->flag.pending_command = 0; to prevent it from being reprocessed.

I think we need to set c->flag.pending_command = 0; also in the cluster case after sending MOVED. Otherwise I suspect we actually have a bug that we will send two replies, which is a real bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have now verified this. The client gets two MOVED redirects for one request!

I verified by modifying this test case and it still passes:

diff --git a/tests/unit/cluster/replica-redirect.tcl b/tests/unit/cluster/replica-redirect.tcl
index ac9c8bd30..70fba8fe4 100644
--- a/tests/unit/cluster/replica-redirect.tcl
+++ b/tests/unit/cluster/replica-redirect.tcl
@@ -38,6 +38,7 @@ start_cluster 1 1 {tags {external:skip cluster}} {
 
         # Check that the client blocking on the old primary was MOVED to the new primary.
         assert_error "MOVED *" {$rd0 read}
+        assert_error "MOVED *" {$rd0 read}
 
         # Check that the readonly client blocking on the old primary is still blocked.
         assert_equal 1 [s 0 blocked_clients]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @zuiderkwast for looking and explaining this. I will be closing this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sarthakaggarwal97! It was your comment "they may receive an async MOVED error that is logged without their request" that led me to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[TEST-FAILURE] reply-schemas-validator failing in daily test

3 participants