Skip to content

Conversation

@y-rabie
Copy link
Contributor

@y-rabie y-rabie commented Oct 16, 2025

What type of PR is this?

fix: keep sending unready/non-serving endpoints as unhealthy

What this PR does / why we need it:

Currently, unready (those with serving=false or ready=false) endpoints are not included in the upstream cluster sent to the data plane.

This is problematic, since this means that we can never truly reach panic mode when a large number of pods become unready (through failing their k8s readiness probe). We should keep sending those endpoints as unhealthy/draining, just similar to what we do if they were terminating.

Which issue(s) this PR fixes:

Fixes #

Release Notes: Yes/No

@y-rabie y-rabie requested a review from a team as a code owner October 16, 2025 10:36
@codecov
Copy link

codecov bot commented Oct 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.06%. Comparing base (70af785) to head (df433e0).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7253      +/-   ##
==========================================
- Coverage   71.07%   71.06%   -0.01%     
==========================================
  Files         228      228              
  Lines       40825    40825              
==========================================
- Hits        29015    29012       -3     
- Misses      10104    10105       +1     
- Partials     1706     1708       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@y-rabie
Copy link
Contributor Author

y-rabie commented Oct 16, 2025

/retest

@jukie
Copy link
Contributor

jukie commented Oct 21, 2025

I've been trying to find the github issue where it was discussed but I believe this behavior was in place in the past and then intentionally changed to completely removing any non-ready endpoints.

@arkodg
Copy link
Contributor

arkodg commented Oct 21, 2025

-1 on this change, Im unable to understand how delaying endpoint removal is helpful

@y-rabie
Copy link
Contributor Author

y-rabie commented Oct 21, 2025

@arkodg I'm not sure what you mean by "delaying the endpoint removal". But it seems to implicate that endpoints get unready only before termination which is not true, endpoints can fail readiness probe, become unready and then get back to ready again.

Pods can fail readiness probe for whatever reason (think upstream service dependencies, in our case, that's database connection). If we get 50% of pods unready once, those pods are removed and only the 50% healthy ones remain. How is panic mode triggered in this case? It's not. And consequently you risk overwhelming the 50% of healthy pods, causing cascading failures (in the worst case, reaching 0 ready pods and returning 500).

If we go by the current state, we're forced to let go of our k8s readiness probes and completely depend on envoy's active health checks if we want to utilize panic mode. But this is not a justified constraint, since other parts of the system depend on readiness probes, not just traffic (e.g., metric scraping happens to ready pods only).

@arkodg
Copy link
Contributor

arkodg commented Oct 21, 2025

Pods can fail readiness probe for whatever reason (think upstream service dependencies, in our case, that's database connection). If we get 50% of pods unready once, those pods are removed and only the 50% healthy ones remain. How is panic mode triggered in this case? It's not. And consequently you risk overwhelming the 50% of healthy pods, causing cascading failures (in the worst case, reaching 0 ready pods and returning 500).

What is your definition of panic mode here ? If the health of those ready pods is degraded due to being overloaded, you can

  1. proactively deal with that using circuit breakers for existing pods
  2. scale out backend pods
  3. set panic threshold in health check https://gateway.envoyproxy.io/docs/api/extension_types/#healthcheck based on your use case

@y-rabie
Copy link
Contributor Author

y-rabie commented Oct 21, 2025

@arkdog Let me rephrase my question: how is panic mode ever triggered if you keep removing unhealthy endpoints, and all you have are healthy endpoints 😅?

@jukie
Copy link
Contributor

jukie commented Oct 21, 2025

In the current state if a pod becomes unready (or even starts being terminated from k8s) but has open connections and is then removed from xDS, does envoy still gracefully close those connections or does it disruptively sever?

@jukie
Copy link
Contributor

jukie commented Oct 21, 2025

The panic mode case does make sense to me though @arkodg. Panic threshold currently wouldn't work unless the pod's readiness check is different from the configured active health check and specifically the active health check is what begins failing.

@y-rabie
Copy link
Contributor Author

y-rabie commented Oct 21, 2025

@jukie I've tested it before, and it seems any inflight requests are completed, so yes gracefully. But my angle here is purely panic-mode related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants