Skip to content

feat(graph): per-node timing + datastore-read attribution in the check debug trace (internal proto)#3199

Open
poucet wants to merge 1 commit into
authzed:mainfrom
poucet:feat/check-trace-timing
Open

feat(graph): per-node timing + datastore-read attribution in the check debug trace (internal proto)#3199
poucet wants to merge 1 commit into
authzed:mainfrom
poucet:feat/check-trace-timing

Conversation

@poucet

@poucet poucet commented Jun 30, 2026

Copy link
Copy Markdown

What

Adds two pieces of timing to the internal dispatch CheckDebugTrace so a
single Check / CheckBulk debug trace (--with-tracing) can answer two
questions that duration alone cannot:

  1. Were sub-problems traversed concurrently? Each node now carries a
    wall-clock start_time; combined with the existing authoritative duration,
    every dispatch node lands on one absolute time axis, so sibling overlap (or
    serialization behind the concurrency limit) is reconstructable.
  2. Where was time spent blocked on datastore reads? Each node now carries
    repeated DatastoreQuery datastore_queries{query_shape, start_time, duration, relationship_count} for the relationship reads that node issued
    itself. Per-query events (not one summed number) because a node can issue
    concurrent reads, so the events show overlap instead of hiding it.

Implements the internal-proto slice of #3198.

How

  • proto/internal/dispatch/v1/dispatch.proto: CheckDebugTrace.start_time (9),
    CheckDebugTrace.datastore_queries (10), and a new DatastoreQuery message.
  • internal/dstrace: a small leaf package with a context-scoped,
    mutex-guarded Collector (a node may issue concurrent reads).
  • internal/graph/check.go: when Debug != NO_DEBUG, install a fresh collector
    per node (dispatched children re-enter Check and install their own, so reads
    attribute to the node that issued them), and stamp start_time +
    datastore_queries next to the existing duration.
  • internal/datastore/proxy/observable.go: the proxy already times each query
    from issue until the iterator is fully drained; it now also records that
    observation into the collector when one is present. This is a single
    conditional branch and a no-op when debug tracing is off.

Scope / follow-up

Internal-proto only. The public authzed.api.v1.CheckDebugTrace fields
(start_time/end_time/datastore_queries) live in authzed/api and must
land + be published there before the buf dep can be bumped and
internal/services/v1/debug.go convertCheckTrace can surface them. The
internal trace already round-trips through DebugInfo, so the approach is fully
exercised end-to-end here.

Tests

  • internal/dstrace: collector unit tests (recording, concurrency safety,
    nil-safety, nested independence).
  • internal/datastore/proxy/observable_test.go: proxy records query events
    (forward + reverse) into a context collector; no-op without one.
  • internal/graph/computed: an integration test wrapping memdb in the
    observable proxy and running a multi-hop check with trace debugging, then
    walking the dispatch trace to assert every executed node has a start_time
    and that datastore reads are attributed as DatastoreQuery events.

Compatibility

New fields only (9/10 internal); buf breaking ... use: WIRE passes. duration
is unchanged and remains authoritative. The collector and the proxy branch only
activate under debug tracing, so the non-debug hot path is untouched.

@poucet poucet requested a review from a team as a code owner June 30, 2026 08:41
@github-actions

Copy link
Copy Markdown

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request

@github-actions github-actions Bot added area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools) labels Jun 30, 2026
…eck debug trace

Adds start_time and datastore_queries to the internal dispatch CheckDebugTrace
so a single Check / CheckBulk debug trace can answer two questions that
duration alone cannot: (1) whether sub-problems were traversed concurrently,
and (2) where time was spent blocked on datastore reads.

start_time is the wall-clock anchor for each node (the monotonic duration stays
authoritative; end = start_time + duration). A context-scoped, concurrency-safe
collector is installed per dispatch node; the observable datastore proxy records
each relationship query (shape, start, duration, row count) into it, so reads
attribute to the node that issued them. Both only activate when debug tracing is
enabled, leaving the non-debug hot path untouched.

This is the internal-proto-only slice; the public authzed/api CheckDebugTrace
fields and the internal->public conversion follow once that field lands.

Refs authzed#3198
@poucet poucet force-pushed the feat/check-trace-timing branch from 28f7adc to 181d295 Compare June 30, 2026 08:44
@poucet poucet marked this pull request as draft June 30, 2026 08:44
@poucet poucet marked this pull request as ready for review June 30, 2026 08:50
@poucet

poucet commented Jun 30, 2026

Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/datastore Affects the storage system area/tooling Affects the dev or user toolchain (e.g. tests, ci, build tools)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant