Skip to content

[WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages#1

Draft
subinz1 wants to merge 14 commits into
mainfrom
oot-hud-pipeline
Draft

[WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages#1
subinz1 wants to merge 14 commits into
mainfrom
oot-hud-pipeline

Conversation

@subinz1
Copy link
Copy Markdown
Owner

@subinz1 subinz1 commented Apr 24, 2026

Summary

Implements the HUD-side ingestion and display for Out-of-Tree CI results, as described in the OOT HUD RFC V3. This is the end-to-end pipeline: from receiving relay callbacks to displaying results on HUD pages.

Write Path

  • API endpoint (torchci/pages/api/oot/results.ts): Receives {trusted, untrusted} payloads from the result Lambda (PR [CRCR] Initial implementation of L2 pytorch/test-infra#7967), validates auth (x-hud-internal-bot), enforces 2MB payload cap and daily budget, extracts/flattens fields, and writes to DynamoDB (torchci-oot-workflow-job)
  • ClickHouse schema (clickhouse_db_schema/default.oot_workflow_job/schema.sql): New table with OOT-specific columns (test counts, artifact URL, environment, relay-measured timing metrics)
  • Replicator mapping: Added torchci-oot-workflow-jobdefault.oot_workflow_job to clickhouse-replicator-dynamo Lambda

Read Path

  • Global OOT Summary (/oot): Table of all OOT backend repos sorted by pass rate, with avg duration and last run time
  • Per-Backend Dashboard (/oot/[org]/[repo]): Matrix view — rows = PyTorch PRs, columns = downstream CI jobs, color-coded status chips
  • PR View Integration: Collapsible "Out-of-Tree Backends" accordion on existing PR pages, showing OOT results when they exist

Files Changed

File Action
torchci/pages/api/oot/results.ts New — API endpoint
torchci/lib/oot/ootUtils.ts New — types, validation, extraction
clickhouse_db_schema/default.oot_workflow_job/schema.sql New — CH schema
aws/lambda/clickhouse-replicator-dynamo/lambda_function.py Edit — +1 line to SUPPORTED_TABLES
torchci/pages/oot/index.tsx New — global summary page
torchci/pages/oot/[org]/[repo].tsx New — per-backend dashboard
torchci/components/oot/OotPrSection.tsx New — PR view OOT section
torchci/pages/[repoOwner]/[repoName]/pull/[prNumber].tsx Edit — added OotPrSection
torchci/clickhouse_queries/oot_summary/* New — saved query
torchci/clickhouse_queries/oot_backend_dashboard/* New — saved query
torchci/clickhouse_queries/oot_pr_results/* New — saved query

Test plan

  • Verify API endpoint accepts valid {trusted, untrusted} payloads and writes to DynamoDB
  • Verify auth rejects requests without x-hud-internal-bot header
  • Verify payload > 2MB is rejected with 400
  • Verify daily budget enforcement (429 after limit)
  • Verify ClickHouse schema creates successfully
  • Verify replicator picks up new DynamoDB records and inserts into ClickHouse
  • Verify /oot page renders summary table with correct pass rates
  • Verify /oot/[org]/[repo] renders matrix view with correct status chips
  • Verify PR page shows OOT accordion when results exist, hides when empty

Authored with Claude.

…mapping, and frontend pages

Implements the HUD-side ingestion and display for Out-of-Tree CI results,
as described in the OOT HUD RFC V3. The relay (PR pytorch#7967) forwards
{trusted, untrusted} payloads to the new /api/oot/results endpoint, which
validates, extracts fields, and writes to DynamoDB. DynamoDB Streams
replicates to ClickHouse via the existing replicator Lambda. Three frontend
views display the results: a global OOT summary, a per-backend matrix
dashboard, and a collapsible section on PR pages.
@subinz1 subinz1 marked this pull request as draft April 25, 2026 08:03
@subinz1 subinz1 changed the title [OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages [WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages Apr 25, 2026
subinz1 added a commit to subinz1/rfcs that referenced this pull request Apr 28, 2026
Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers the complete
write path (Result Lambda → HUD API → DynamoDB → ClickHouse), three
frontend views (global summary, per-backend dashboard, PR integration),
storage schemas, DB protection (rate limits, payload caps, daily budgets),
and security design (OIDC, trusted/untrusted split, callback token proposal).

Reference implementation: subinz1/test-infra#1
subinz1 added a commit to subinz1/rfcs that referenced this pull request Apr 28, 2026
Rename from RFC-0051 to RFC-0001. Defines the HUD-side ingestion and
display layer for OOT CI results, building on the Cross-Repository CI
Relay. Covers write path, storage schemas, DB protection, security,
and three frontend views.

Reference implementation: subinz1/test-infra#1
subinz1 added 6 commits May 4, 2026 15:57
Address @ZainRizvi's review on pytorch/rfcs#96:

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: removed schema validation from HUD (moved to relay)
- Removed daily budget enforcement
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Timestamps: use downstream-reported started_at/completed_at instead of now()
- Timing metrics: only set queue_time/execution_time when non-null
- ClickHouse schema: added job_name, run_attempt columns
- Queries: select job_name, run_attempt as proper columns
- Frontend: updated interfaces to include new fields
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted
dict, determined from the allowlist rather than self-reported by
downstream. Updated RelayTrusted, OotWorkflowJobRecord, and
extractDynamoRecord to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id,
and run_id to RelayWorkflow, OotWorkflowJobRecord, ClickHouse schema,
queries, and frontend interfaces. Switch dynamoKey from run_attempt to
check_run_id for per-execution uniqueness. Fix test-results key to use
the hyphenated form matching the L2 action. Update buildMatrix to
prefer highest run_attempt for deduplication.
The L2 action uses test_results (underscore) in the workflow dict, not
test-results (hyphen). Removed failures/failed_tests_json handling —
the L2 action sends summary counts only; detailed results go via
artifact_url.
@subinz1 subinz1 force-pushed the oot-hud-pipeline branch from 3370b70 to 7a45ca7 Compare May 12, 2026 11:57
Surface the relay-determined repo level (L1–L4) in the OOT Summary
table so users can see each backend's integration tier at a glance.
subinz1 pushed a commit to subinz1/rfcs that referenced this pull request May 13, 2026
Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers write path,
storage schemas, DB protection, security, and three frontend views.
Reference implementation: subinz1/test-infra#1
- Changed 502 to 500 for DynamoDB write errors in results.ts
  (502 Bad Gateway implies a proxy; this API is the originating server)
- Removed unused groupKey variable in buildMatrix()

Addresses review feedback from @KarhouTam.
@subinz1 subinz1 force-pushed the oot-hud-pipeline branch from 05e2bcb to e9f6292 Compare May 15, 2026 04:10
subinz1 added 5 commits May 15, 2026 09:46
Both [repo].tsx and OotPrSection.tsx had identical copies of these
functions. Moved to lib/oot/ootUtils.ts and imported from there.

Addresses review feedback from @KarhouTam.
The L2 relay action sends run_attempt as a string (from env var).
Coerce to number with Number() to ensure consistent DynamoDB/ClickHouse
types. Also add artifact_url to RelayWorkflow and extract it in
extractDynamoRecord so downstream-provided artifact links flow through
to DynamoDB and ClickHouse.
The L2 relay action sends test_results with {passed, failed, skipped}
but no total field. Compute total_tests as the sum when tr.total is
not provided, preserving backward compatibility if total is present.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant