Skip to content

aigw: better envoy launch error handling #1304

@codefromthecrypt

Description

@codefromthecrypt

Right now, if you have an invalid config envoy will crash and there will be no output, and the process will hang. If you supply —debug, you can see why it crashed but it still hangs.

2025-10-08T07:57:23.453-0400	ERROR	provider	file/file.go:100	failed to reload resources initially	{"runner": "provider", "error": "failed to load resources from file /var/folders/qy/4hghk2xs5wd4jhjtg5ffxdwr0000gn/T/aigw-run/envoy-ai-gateway-resources/config.yaml: local validation error: spec.telemetry.accessLog.settings[0].format.omit_empty_values: Invalid value: value provided for unknown field"}
time=2025-10-08T07:57:26.318-04:00 level=WARN msg="Falling back to Envoy listener adminPort check" err="timeout waiting for child processes: no child process found"
time=2025-10-08T07:57:28.319-04:00 level=INFO msg="Waiting for Envoy to be ready..." err="dial tcp 127.0.0.1:1975: connect: connection refused"

A couple improvements we should make:

  1. Ensure we always see ERROR from envoy in aigw output, as this explains failures. So, reconfigure so we never have zero output on ERROR.
  2. Ensure when envoy crashes, we crash
  • If possible, with the way cobra is working, have its failures cancel our cancelable context to end the goroutines
  • If not, refactor out the subprocess detection code in the admin-lookup helper and crash unconditionally of admin server if the envoy subprocess died.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions