Skip to content

tools/stress/device-observer: implement agent prom scraper#3808

Draft
nikw9944 wants to merge 2 commits into
nikw9944/doublezero-3793from
nikw9944/doublezero-3794
Draft

tools/stress/device-observer: implement agent prom scraper#3808
nikw9944 wants to merge 2 commits into
nikw9944/doublezero-3793from
nikw9944/doublezero-3794

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

Summary of Changes

  • Replaces the promscrape Noop stub with a real Prometheus scraper: on every --sample-interval tick the observer fetches --agent-metrics-url, parses the response with expfmt.NewTextParser, and appends one NDJSON row {t_ns, metric_name, value, labels_json} per metric sample to <working-dir>/observer.agent_metrics.json.
  • Surfaces Scraper.Snapshot() returning the latest counter family totals (sum across label series). The abort decider that lands in 3747-4: abort decider + sentinel (~200 LOC code) #3796 consumes this; gauges and other types are intentionally excluded since the trigger inputs are counter deltas only.
  • Wires the scraper into main.go (one-line change to the existing collector list).
  • Defense-in-depth: caps the response body at 16 MiB via io.LimitReader; treats an empty-but-2xx response as a soft failure so a transient empty body cannot clobber the snapshot to {}.
  • Documents the metrics-output schema in the README (NDJSON row format, field meanings) and updates the layout block.
  • Parent issue: stress: implement tools/stress/device-observer #3747. Stacks on top of 3747-1: scaffolding + eAPI device sampler (~200 LOC code) #3793 (nikw9944/doublezero-3793).
  • Fixes 3747-2: agent Prometheus scrape (~150 LOC code) #3794

Testing Verification

  • go test -race -count=1 ./tools/stress/device-observer/internal/promscrape/... — 9 unit tests covering: happy-path NDJSON shape, label serialization round-trip, counter-only snapshot, snapshot stable across HTTP 500, snapshot reflects latest after a successful tick, empty-body freezes snapshot, malformed exposition body is logged and skipped, context cancellation, NDJSON line integrity.
  • golangci-lint run ./tools/stress/device-observer/... clean.
  • Compile-time var _ collector.Collector = (*Scraper)(nil) assertion guarantees the scraper still satisfies the collector interface used by main.go's errgroup.

nikw9944 added 2 commits May 29, 2026 20:07
Replaces the promscrape Noop stub with a real scraper that, on every
--sample-interval tick, fetches the doublezero-agent Prometheus endpoint at
--agent-metrics-url and appends one NDJSON row per metric sample to
observer.agent_metrics.json in the working directory.

Also exposes Scraper.Snapshot() returning the latest counter family totals
so the abort decider (PR #3796) can detect mid-sample counter increments
without standing up its own scraper.

Refs #3794
- Cap the scrape response body at 16 MiB via io.LimitReader (defense in
  depth against a misbehaving or compromised agent endpoint).
- Treat an empty-but-2xx response as a soft failure that freezes the
  snapshot, so the abort decider does not interpret a transient empty body
  as 'counters reset to zero'.
- Document that disk-write failures intentionally freeze the snapshot so
  the decider only ever sees fully-persisted ticks.
- Tone down the appendRows atomicity comment — the relevant invariant is
  'single writer per working directory', not POSIX append atomicity.
- Add TestEmptyBodyFreezesSnapshot.

Refs #3794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant