-
Notifications
You must be signed in to change notification settings - Fork 927
Fixes req-res-log-validator to skip unexpected client replies #2732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes req-res-log-validator to skip unexpected client replies #2732
Conversation
Signed-off-by: Sarthak Aggarwal <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2732 +/- ##
============================================
- Coverage 72.61% 72.60% -0.01%
============================================
Files 128 128
Lines 71278 71278
============================================
- Hits 51755 51750 -5
- Misses 19523 19528 +5 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with this file. So, some rudimentary questions.
Is this a unique problem to blocked clients ? Would be nice if you could add why did we start observing this and what does this change does? Why is it under Request
flow? -MOVED *
shouldn't it be a response generated by the engine.
Thanks @hpatro for taking a look! If my understanding is correct, after #2329, we started returning The validator fails, because it is expecting to parse a request, but ends up with out of line |
# When blocked clients are redirected after a failover they may | ||
# receive an async MOVED error that is logged without their request. | ||
# These responses start with a RESP type prefix. Rewind the cursor | ||
# to skip it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the MOVED errors logged without their request? Isn't that the source of this error? Can we fix that instead when we log the requests and responses?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the 1:1 match between request logging and response logging is broken.
- reqresAppendRequest() is called in processCommand() in server.c if not reprocessing command
- reqresAppendResponse() is called in commandProcessed() in networking.c if the client is not blocked
In the PR that introduced this bug, on these lines https://github.com/valkey-io/valkey/pull/2329/files#diff-ddd6be3a7a9289fbf41890c34da16f1ea5b0a8d6d9206ae1ff5303e50bae1970R342 we call clusterRedirectBlockedClientIfNeeded()
and directly afterwards unblockClient()
and then the command is reprocessed. This means it gets one more reply?
Compare to the standalone case, we send the -REDIRECT error and then call unblockClientOnError(), which sets c->flag.pending_command = 0;
to prevent it from being reprocessed.
I think we need to set c->flag.pending_command = 0;
also in the cluster case after sending MOVED. Otherwise I suspect we actually have a bug that we will send two replies, which is a real bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I have now verified this. The client gets two MOVED redirects for one request!
I verified by modifying this test case and it still passes:
diff --git a/tests/unit/cluster/replica-redirect.tcl b/tests/unit/cluster/replica-redirect.tcl
index ac9c8bd30..70fba8fe4 100644
--- a/tests/unit/cluster/replica-redirect.tcl
+++ b/tests/unit/cluster/replica-redirect.tcl
@@ -38,6 +38,7 @@ start_cluster 1 1 {tags {external:skip cluster}} {
# Check that the client blocking on the old primary was MOVED to the new primary.
assert_error "MOVED *" {$rd0 read}
+ assert_error "MOVED *" {$rd0 read}
# Check that the readonly client blocking on the old primary is still blocked.
assert_equal 1 [s 0 blocked_clients]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @zuiderkwast for looking and explaining this. I will be closing this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sarthakaggarwal97! It was your comment "they may receive an async MOVED error that is logged without their request" that led me to it.
Resolves #2676
The workflow is passing after the change in my local repo.