fix: keep sending unready/non-serving endpoints as unhealthy #7253

y-rabie · 2025-10-16T10:36:26Z

What type of PR is this?

fix: keep sending unready/non-serving endpoints as unhealthy

What this PR does / why we need it:

Currently, unready (those with serving=false or ready=false) endpoints are not included in the upstream cluster sent to the data plane.

This is problematic, since this means that we can never truly reach panic mode when a large number of pods become unready (through failing their k8s readiness probe). We should keep sending those endpoints as unhealthy/draining, just similar to what we do if they were terminating.

Which issue(s) this PR fixes:

Fixes #

Release Notes: Yes/No

Signed-off-by: y-rabie <[email protected]>

codecov · 2025-10-16T10:45:38Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 71.06%. Comparing base (70af785) to head (df433e0).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7253      +/-   ##
==========================================
- Coverage   71.07%   71.06%   -0.01%     
==========================================
  Files         228      228              
  Lines       40825    40825              
==========================================
- Hits        29015    29012       -3     
- Misses      10104    10105       +1     
- Partials     1706     1708       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

y-rabie · 2025-10-16T12:17:38Z

/retest

jukie · 2025-10-21T01:34:24Z

I've been trying to find the github issue where it was discussed but I believe this behavior was in place in the past and then intentionally changed to completely removing any non-ready endpoints.

internal/gatewayapi/route_test.go

arkodg · 2025-10-21T01:47:17Z

-1 on this change, Im unable to understand how delaying endpoint removal is helpful

internal/gatewayapi/route.go

y-rabie · 2025-10-21T09:49:43Z

@arkodg I'm not sure what you mean by "delaying the endpoint removal". But it seems to implicate that endpoints get unready only before termination which is not true, endpoints can fail readiness probe, become unready and then get back to ready again.

Pods can fail readiness probe for whatever reason (think upstream service dependencies, in our case, that's database connection). If we get 50% of pods unready once, those pods are removed and only the 50% healthy ones remain. How is panic mode triggered in this case? It's not. And consequently you risk overwhelming the 50% of healthy pods, causing cascading failures (in the worst case, reaching 0 ready pods and returning 500).

If we go by the current state, we're forced to let go of our k8s readiness probes and completely depend on envoy's active health checks if we want to utilize panic mode. But this is not a justified constraint, since other parts of the system depend on readiness probes, not just traffic (e.g., metric scraping happens to ready pods only).

arkodg · 2025-10-21T16:56:50Z

Pods can fail readiness probe for whatever reason (think upstream service dependencies, in our case, that's database connection). If we get 50% of pods unready once, those pods are removed and only the 50% healthy ones remain. How is panic mode triggered in this case? It's not. And consequently you risk overwhelming the 50% of healthy pods, causing cascading failures (in the worst case, reaching 0 ready pods and returning 500).

What is your definition of panic mode here ? If the health of those ready pods is degraded due to being overloaded, you can

proactively deal with that using circuit breakers for existing pods
scale out backend pods
set panic threshold in health check https://gateway.envoyproxy.io/docs/api/extension_types/#healthcheck based on your use case

y-rabie · 2025-10-21T17:12:36Z

@arkdog Let me rephrase my question: how is panic mode ever triggered if you keep removing unhealthy endpoints, and all you have are healthy endpoints 😅?

jukie · 2025-10-21T20:13:08Z

In the current state if a pod becomes unready (or even starts being terminated from k8s) but has open connections and is then removed from xDS, does envoy still gracefully close those connections or does it disruptively sever?

jukie · 2025-10-21T20:14:33Z

The panic mode case does make sense to me though @arkodg. Panic threshold currently wouldn't work unless the pod's readiness check is different from the configured active health check and specifically the active health check is what begins failing.

y-rabie · 2025-10-21T20:15:16Z

@jukie I've tested it before, and it seems any inflight requests are completed, so yes gracefully. But my angle here is purely panic-mode related

fix: keep sending unready/non-serving endpoints as unhealthy

66df46d

Signed-off-by: y-rabie <[email protected]>

y-rabie requested a review from a team as a code owner October 16, 2025 10:36

Merge branch 'main' into keep-unready-endpoints

df433e0

jukie reviewed Oct 21, 2025

View reviewed changes

internal/gatewayapi/route_test.go Show resolved Hide resolved

jukie reviewed Oct 21, 2025

View reviewed changes

internal/gatewayapi/route.go Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: keep sending unready/non-serving endpoints as unhealthy #7253

fix: keep sending unready/non-serving endpoints as unhealthy #7253

Uh oh!

y-rabie commented Oct 16, 2025

Uh oh!

codecov bot commented Oct 16, 2025

Uh oh!

y-rabie commented Oct 16, 2025

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

Uh oh!

arkodg commented Oct 21, 2025

Uh oh!

Uh oh!

y-rabie commented Oct 21, 2025 •

edited

Loading

Uh oh!

arkodg commented Oct 21, 2025

Uh oh!

y-rabie commented Oct 21, 2025 •

edited

Loading

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

y-rabie commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: keep sending unready/non-serving endpoints as unhealthy #7253

Are you sure you want to change the base?

fix: keep sending unready/non-serving endpoints as unhealthy #7253

Uh oh!

Conversation

y-rabie commented Oct 16, 2025

Uh oh!

codecov bot commented Oct 16, 2025

Codecov Report

Uh oh!

y-rabie commented Oct 16, 2025

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

Uh oh!

arkodg commented Oct 21, 2025

Uh oh!

Uh oh!

y-rabie commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arkodg commented Oct 21, 2025

Uh oh!

y-rabie commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

jukie commented Oct 21, 2025

Uh oh!

y-rabie commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

y-rabie commented Oct 21, 2025 •

edited

Loading

y-rabie commented Oct 21, 2025 •

edited

Loading