feat(transport): add transport-aware health check endpoints#1201
feat(transport): add transport-aware health check endpoints#1201
Conversation
Expose readiness and liveness probes via the Caddy admin API so
Kubernetes (and other orchestrators) can detect when a hub's transport
connection is actually healthy, not just that the process is running.
- New TransportHealthChecker optional interface (Ready / Live) that
transports can implement to report their state.
- New admin.api.mercure_health Caddy module exposing
/mercure/health/{ready,live} aggregate endpoints and
/mercure/health/{name}/{ready,live} per-hub endpoints on the admin
API (port 2019).
- Caddyfile name directive for identifying hubs in per-hub endpoints
and future metrics labels (defaults to "default").
- Helm chart: transport-aware probes enabled by default, with fallback
to the legacy /healthz on HTTP port when healthCheck.enabled=false.
Admin port always exposed; metrics.port kept for backward compat.
- /healthz on the HTTP port is deprecated (only checks that Caddy is
running, not transport connectivity).
Transports that do not implement the interface (Bolt, Local) are
considered always-healthy; the Enterprise transports (Redis, Postgres,
Kafka, Pulsar) implement it in the Mercure Enterprise repository.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds transport-aware readiness/liveness health checks exposed via the Caddy admin API so orchestrators can distinguish “process is up” from “transport is healthy”, and updates the Helm chart/docs to use these endpoints (with /healthz marked deprecated).
Changes:
- Introduces
TransportHealthChecker(Ready/Live) and a new Caddy admin moduleadmin.api.mercure_healthexposing/mercure/health/*endpoints. - Adds a
namedirective for Mercure hubs to support per-hub health endpoints. - Updates Helm chart probes to use the admin health endpoints by default and updates docs/Caddyfiles to deprecate
/healthz.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| transport.go | Adds optional transport health-check interface (Ready/Live). |
| caddy/health.go | Implements admin API health endpoints aggregating per-hub/per-transport status. |
| caddy/health_test.go | Adds coverage for health endpoints (OK/404/405/unhealthy). |
| caddy/mercure.go | Stores hub metadata (name/transport) for health aggregation and parses name directive. |
| charts/mercure/values.yaml | Adds adminPort and new healthCheck values; deprecates metrics.port usage. |
| charts/mercure/templates/deployment.yaml | Adds admin container port and switches probes to /mercure/health/* when enabled. |
| charts/mercure/templates/service.yaml | Routes metrics Service to the admin container port. |
| docs/hub/config.md | Documents new admin API health endpoints and deprecates /healthz. |
| Caddyfile | Marks /healthz as deprecated in comments. |
| dev.Caddyfile | Marks /healthz as deprecated in comments. |
| examples/chat/chart/mercure-example-chat/README.md | Formatting update to values table (helm-docs output). |
| .github/workflows/lint.yml | Workflow config tweak (per diff). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The Caddy admin API binds to localhost:2019 by default for security, so httpGet probes from the kubelet fail with connection refused (the kubelet connects to the pod IP, not localhost inside the container). Switch to exec probes that run curl from inside the container, which matches the pattern already used by the preStop hook. Also update the documentation to reflect this and note the alternative (binding admin to 0.0.0.0, with the security caveat). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return 404 when a per-hub health endpoint is queried for a hub name that doesn't exist, instead of silently responding 200 OK (which hid typos and misconfigurations). - Don't expose internal transport error details in the HTTP response body; log them server-side and return a generic error message to avoid leaking connection details if admin is exposed beyond localhost. - Chart: use adminPort (with metrics.port fallback) for preStop URL and metrics Service port so everything stays aligned when adminPort is customized. - values.yaml: fix helm-docs annotation marker on adminPort. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
A few of the Copilot review threads on this PR look already addressed in the current head (
Separately, a likely root cause for the Generated by Claude Code |
…pine The caddy:2-alpine base image ships ca-certificates, libcap and mailcap but does not include curl. With healthCheck.enabled=true (the new default) the curl-based exec probes would always fail readiness and liveness, blocking ct install and any real deployment. The preStop hook had the same latent issue but failed silently. Switch both probes and the preStop hook to BusyBox wget (which is included in alpine), matching the wget-based Docker healthcheck already recommended in docs/hub/config.md. Also regenerate charts/mercure/README.md so the values table documents the new adminPort and healthCheck.* fields.
The caddy:2-alpine base image ships wget, not curl. Update the Kubernetes and Docker Compose examples in the docs accordingly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Expose readiness and liveness probes via the Caddy admin API so Kubernetes (and other orchestrators) can detect when a hub's transport connection is actually healthy, not just that the process is running.
Transports that do not implement the interface (Bolt, Local) are considered always-healthy; the Enterprise transports (Redis, Postgres, Kafka, Pulsar) implement it in the Mercure Enterprise repository.