-
Notifications
You must be signed in to change notification settings - Fork 578
fix: keep sending unready/non-serving endpoints as unhealthy #7253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: y-rabie <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #7253 +/- ##
==========================================
- Coverage 71.07% 71.06% -0.01%
==========================================
Files 228 228
Lines 40825 40825
==========================================
- Hits 29015 29012 -3
- Misses 10104 10105 +1
- Partials 1706 1708 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/retest |
|
I've been trying to find the github issue where it was discussed but I believe this behavior was in place in the past and then intentionally changed to completely removing any non-ready endpoints. |
|
-1 on this change, Im unable to understand how delaying endpoint removal is helpful |
|
@arkodg I'm not sure what you mean by "delaying the endpoint removal". But it seems to implicate that endpoints get unready only before termination which is not true, endpoints can fail readiness probe, become unready and then get back to ready again. Pods can fail readiness probe for whatever reason (think upstream service dependencies, in our case, that's database connection). If we get 50% of pods unready once, those pods are removed and only the 50% healthy ones remain. How is panic mode triggered in this case? It's not. And consequently you risk overwhelming the 50% of healthy pods, causing cascading failures (in the worst case, reaching 0 ready pods and returning 500). If we go by the current state, we're forced to let go of our k8s readiness probes and completely depend on envoy's active health checks if we want to utilize panic mode. But this is not a justified constraint, since other parts of the system depend on readiness probes, not just traffic (e.g., metric scraping happens to ready pods only). |
What is your definition of panic mode here ? If the health of those ready pods is degraded due to being overloaded, you can
|
|
@arkdog Let me rephrase my question: how is panic mode ever triggered if you keep removing unhealthy endpoints, and all you have are healthy endpoints 😅? |
|
In the current state if a pod becomes unready (or even starts being terminated from k8s) but has open connections and is then removed from xDS, does envoy still gracefully close those connections or does it disruptively sever? |
|
The panic mode case does make sense to me though @arkodg. Panic threshold currently wouldn't work unless the pod's readiness check is different from the configured active health check and specifically the active health check is what begins failing. |
|
@jukie I've tested it before, and it seems any inflight requests are completed, so yes gracefully. But my angle here is purely panic-mode related |
What type of PR is this?
fix: keep sending unready/non-serving endpoints as unhealthy
What this PR does / why we need it:
Currently, unready (those with
serving=falseorready=false) endpoints are not included in the upstream cluster sent to the data plane.This is problematic, since this means that we can never truly reach panic mode when a large number of pods become unready (through failing their k8s readiness probe). We should keep sending those endpoints as unhealthy/draining, just similar to what we do if they were terminating.
Which issue(s) this PR fixes:
Fixes #
Release Notes: Yes/No