Skip to content

Stop SSE reconnect resets from panicking; add staleness metrics#7

Open
ewokndor wants to merge 2 commits into
mainfrom
fix/better-reconnect-strategy
Open

Stop SSE reconnect resets from panicking; add staleness metrics#7
ewokndor wants to merge 2 commits into
mainfrom
fix/better-reconnect-strategy

Conversation

@ewokndor

Copy link
Copy Markdown
Contributor

Symptom:

Recurring error logs {"StreamID":N,"Code":2} ~every 20 min, with the service panicking (and restarting) roughly hourly.

Root cause:

hermes.pyth.network sits behind Cloudflare, which periodically resets the HTTP/2 stream (INTERNAL_ERROR, connection left intact) every ~5–12 min by design — confirmed by reproducing it with plain curl (cf-ray header =
Cloudflare). Two client bugs turned that benign reset into noise + crashes:

  1. r3labs' default reconnect backoff has a 15-min MaxElapsedTime measured from subscription start, so after the stream lived >15 min the next reset surfaced as a fatal error instead of reconnecting.
  2. Our retry counter never reset across independent disconnects, so every 3rd reset hit maxRetries and panicked, crash-looping the pod ~hourly.

So every log line was a real disconnect (~20 min apart); the num_retries 1/2/3 ladder was independent disconnects accumulating toward the panic, not retries of one event.

What we changed (hermes/):

  • Reconnect through Cloudflare resets indefinitely (ReconnectStrategy with MaxElapsedTime=0) — recoverable, logged at info.
  • Removed the panic/retry-counter; unrecoverable errors now log at error and re-subscribe with capped backoff (no more crash-loop).
  • Added health signals for metrics-based alerting:
    • LastStreamUpdate() — global stream liveness (transport health).
    • LastFeedPublishTime(feedID) — per-feed staleness from Pyth's publish_time, catches a single feed freezing even while the stream is alive.

Alerting:

gauges live in the caller. Export time.Since(LastFeedPublishTime(id)) per feed as NoPriceUpdateSince{feed=...}; one Prometheus rule (> 30s) fans out per feed automatically. Plus a global backstop from LastStreamUpdate().

Net effect:

error-log spam and hourly panics gone; reconnects are visible but quiet; stale prices now trip a metric alert regardless of why updates stopped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant