Skip to content

Conversation

@codefromthecrypt
Copy link
Contributor

Description

Before #1309, we had to work around a bug in EG where in standalone mode the admin server couldn't be detected. Since func-e always ensures there's an admin server, and that bug in EG is fixed, we can remove the tech debt.

We also can consolidate the health checking internally with Docker HEALTHCHECK for the same reason.

@codefromthecrypt codefromthecrypt requested a review from a team as a code owner October 9, 2025 01:53
@codefromthecrypt
Copy link
Contributor Author

Here's an example docker healthcheck run, notice it detected the ephemeral admin server as it no longer is blocked by an upstream bug.

± |main U:9 ?:7 ✗| →   docker inspect --format='{{json .State.Health.Log}}' aigw | jq                                                                                                                    
[
  {
    "Start": "2025-10-09T09:48:51.266683328+08:00",
    "End": "2025-10-09T09:48:51.430264538+08:00",
    "ExitCode": 1,
    "Output": "time=2025-10-09T01:48:51.428Z level=ERROR msg=\"Envoy admin server is not ready\" adminPort=38165 error=\"unexpected status code: 503\"\n2025/10/09 01:48:51 Health check failed: unexpected status code: 503\n"                                                                                                                                                                                               
  },
  {
    "Start": "2025-10-09T09:49:01.433312274+08:00",
    "End": "2025-10-09T09:49:01.623144062+08:00",
    "ExitCode": 1,
    "Output": "time=2025-10-09T01:49:01.621Z level=ERROR msg=\"Envoy admin server is not ready\" adminPort=38165 error=\"unexpected status code: 503\"\n2025/10/09 01:49:01 Health check failed: unexpected status code: 503\n"                                                                                                                                                                                               
  },
  {
    "Start": "2025-10-09T09:49:11.626673234+08:00",
    "End": "2025-10-09T09:49:11.813061603+08:00",
    "ExitCode": 0,
    "Output": ""
  }
]

envoyAdmin, err := aigw.NewEnvoyAdminClient(ctx, os.Getpid(), envoyAdminPort)
if err != nil {
stderrLogger.Error("Failed to find Envoy admin server", "error", err)
serverCancel() // Likely a crashed envoy process
Copy link
Contributor Author

@codefromthecrypt codefromthecrypt Oct 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this cancel only occurs if the envoyAdminPort was zero (needs look up), so we still pass on crashed envoy if you supplied yaml with a hard-coded admin server in it (didn't use env config)

Signed-off-by: Adrian Cole <[email protected]>
@mathetake mathetake enabled auto-merge (squash) October 9, 2025 02:05
Signed-off-by: Adrian Cole <[email protected]>
auto-merge was automatically disabled October 9, 2025 02:27

Head branch was pushed to by a user without write access

@mathetake mathetake enabled auto-merge (squash) October 9, 2025 02:28
Signed-off-by: Adrian Cole <[email protected]>
auto-merge was automatically disabled October 9, 2025 02:39

Head branch was pushed to by a user without write access

@mathetake mathetake enabled auto-merge (squash) October 9, 2025 02:42
@mathetake
Copy link
Member

Still failing

@codefromthecrypt
Copy link
Contributor Author

opened envoyproxy/gateway#7176 on the race

Signed-off-by: Adrian Cole <[email protected]>
auto-merge was automatically disabled October 9, 2025 03:11

Head branch was pushed to by a user without write access

@codefromthecrypt
Copy link
Contributor Author

added a commit to dump the logs on flake since gateway main has this working I think

@codefromthecrypt
Copy link
Contributor Author

actually I can get this. there's a race where the admin address isn't yet valid file

Signed-off-by: Adrian Cole <[email protected]>
@codefromthecrypt
Copy link
Contributor Author

so glad debug works now. 🤞 this gets it

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 70.45455% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.64%. Comparing base (f752c57) to head (43c4581).

Files with missing lines Patch % Lines
internal/aigw/admin.go 70.45% 12 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (70.45%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (77.64%) is below the target coverage (86.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1314      +/-   ##
==========================================
+ Coverage   77.59%   77.64%   +0.05%     
==========================================
  Files         123      123              
  Lines       15717    15718       +1     
==========================================
+ Hits        12195    12204       +9     
+ Misses       2896     2888       -8     
  Partials      626      626              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

return "", fmt.Errorf("timeout waiting for admin address file %s: %w", path, lastErr)
case <-ticker.C:
// Verify it's a file
if info, err := os.Stat(path); err != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key code. before, we weren't waiting for envoy to actually write the admin address file, just erroring if it wasn't there, yet

@codefromthecrypt
Copy link
Contributor Author

woot green!

@mathetake mathetake merged commit 7cdbbd3 into envoyproxy:main Oct 9, 2025
31 checks passed
johnugeorge pushed a commit to johnugeorge/ai-gateway that referenced this pull request Oct 9, 2025
nutanix-Hrushikesh pushed a commit to nutanix-Hrushikesh/ai-gateway that referenced this pull request Oct 16, 2025
nutanix-Hrushikesh pushed a commit to nutanix-Hrushikesh/ai-gateway that referenced this pull request Oct 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants