Stop SSE reconnect resets from panicking; add staleness metrics by ewokndor · Pull Request #7 · berachain/go-pyth-client

ewokndor · 2026-06-18T21:21:43Z

Symptom:

Recurring error logs {"StreamID":N,"Code":2} ~every 20 min, with the service panicking (and restarting) roughly hourly.

Root cause:

hermes.pyth.network sits behind Cloudflare, which periodically resets the HTTP/2 stream (INTERNAL_ERROR, connection left intact) every ~5–12 min by design — confirmed by reproducing it with plain curl (cf-ray header =
Cloudflare). Two client bugs turned that benign reset into noise + crashes:

r3labs' default reconnect backoff has a 15-min MaxElapsedTime measured from subscription start, so after the stream lived >15 min the next reset surfaced as a fatal error instead of reconnecting.
Our retry counter never reset across independent disconnects, so every 3rd reset hit maxRetries and panicked, crash-looping the pod ~hourly.

So every log line was a real disconnect (~20 min apart); the num_retries 1/2/3 ladder was independent disconnects accumulating toward the panic, not retries of one event.

What we changed (hermes/):

Reconnect through Cloudflare resets indefinitely (ReconnectStrategy with MaxElapsedTime=0) — recoverable, logged at info.
Removed the panic/retry-counter; unrecoverable errors now log at error and re-subscribe with capped backoff (no more crash-loop).
Added health signals for metrics-based alerting:
- LastStreamUpdate() — global stream liveness (transport health).
- LastFeedPublishTime(feedID) — per-feed staleness from Pyth's publish_time, catches a single feed freezing even while the stream is alive.

Alerting:

gauges live in the caller. Export time.Since(LastFeedPublishTime(id)) per feed as NoPriceUpdateSince{feed=...}; one Prometheus rule (> 30s) fans out per feed automatically. Plus a global backstop from LastStreamUpdate().

Net effect:

error-log spam and hourly panics gone; reconnects are visible but quiet; stale prices now trip a metric alert regardless of why updates stopped.

… on shutdown, and add a test

ewokndor added 2 commits June 18, 2026 16:18

stop SSE reconnect resets from panicking; add staleness metrics

3c1405c

fix: make SSE reconnect strategy context-aware to stop goroutine leak…

cb0f659

… on shutdown, and add a test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop SSE reconnect resets from panicking; add staleness metrics#7

Stop SSE reconnect resets from panicking; add staleness metrics#7
ewokndor wants to merge 2 commits into
mainfrom
fix/better-reconnect-strategy

ewokndor commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewokndor commented Jun 18, 2026

Symptom:

Root cause:

What we changed (hermes/):

Alerting:

Net effect:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant