-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vtorc
: require topo for Healthy: true
in /debug/health
#17129
base: main
Are you sure you want to change the base?
vtorc
: require topo for Healthy: true
in /debug/health
#17129
Conversation
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17129 +/- ##
=======================================
Coverage 67.40% 67.40%
=======================================
Files 1570 1570
Lines 252903 252909 +6
=======================================
+ Hits 170460 170465 +5
- Misses 82443 82444 +1 ☔ View full report in Codecov by Sentry. |
Added backport labels because I think it's unsafe for VTOrc to report it is healthy when it cannot see the topo. This could lead to outages if a vtorc config is invalid and deployed for some time |
a8d6422
to
81e8f81
Compare
Signed-off-by: Tim Vaillancourt <[email protected]>
8c738f9
to
621e2a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me. I do have one question inline.
Re: backports, deferring to @deepthi
// We refresh all information from the topo once before we start the ticks to do it on a timer. | ||
populateAllInformation() | ||
// We refresh all information from the topo once before we start the ticks to do | ||
// it on a timer. We can wait forever (context.Background()) for this call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain how do you mean "We can wait forever (context.Background()) for this call."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach this comment is confusing, yeah
What I intended with adding the context.Context
is to prevent ticks from overlapping here (currently no timeout)
This area of the code is ran once before the ticker starts at vtorc init time, so I figured waiting forever had no risk. Adding a timeout here as well wouldn't hurt either I suppose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is to prevent ticks from overlapping here
Sorry, not "overlapping" per se, but one loop iteration stalling things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do prefer that we have some form of timeout -- let's say this errors and we cannot connect to topo -- is there going to be some retry? If yes, let's timeout. If not, let's consider our options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shlomi-noach makes sense, I've updated to use a timeout == topo.RemoteOperationTimeout
👍
Signed-off-by: Tim Vaillancourt <[email protected]>
Description
This PR addresses #17121 by requiring that we're able to reach the topology at least once (even if it's just empty) before returning
Healthy: true
in/debug/health
. Today a VTOrc is considered healthy purely if it can write to it's own databaseThis is to prevent a
vtorc
deployment with a broken topo config to be seen as healthy, which in Kube can cause a bad config deploy to rollout to all nodes when it could fail on the first that it breaksAlso some error logging was consolidated so we don't log the topo failure multiple times for the same call (see below)
Example with this change:
And in another shell:
(previous to this PR
"Healthy": true
was returned)Backport reason
I think this should be backported to prevent users from the scenario described above. This issue could lead to a user believing a VTOrc deployment is healthy when it's actually unable to load anything from the topo
Related Issue(s)
Resolves #17121
Checklist
Deployment Notes