[WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages#1
Draft
subinz1 wants to merge 14 commits into
Draft
[WIP][OOT HUD] Full pipeline: API endpoint, ClickHouse schema, replicator mapping, and frontend pages#1subinz1 wants to merge 14 commits into
subinz1 wants to merge 14 commits into
Conversation
…mapping, and frontend pages Implements the HUD-side ingestion and display for Out-of-Tree CI results, as described in the OOT HUD RFC V3. The relay (PR pytorch#7967) forwards {trusted, untrusted} payloads to the new /api/oot/results endpoint, which validates, extracts fields, and writes to DynamoDB. DynamoDB Streams replicates to ClickHouse via the existing replicator Lambda. Three frontend views display the results: a global OOT summary, a per-backend matrix dashboard, and a collapsible section on PR pages.
subinz1
added a commit
to subinz1/rfcs
that referenced
this pull request
Apr 28, 2026
Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers the complete write path (Result Lambda → HUD API → DynamoDB → ClickHouse), three frontend views (global summary, per-backend dashboard, PR integration), storage schemas, DB protection (rate limits, payload caps, daily budgets), and security design (OIDC, trusted/untrusted split, callback token proposal). Reference implementation: subinz1/test-infra#1
subinz1
added a commit
to subinz1/rfcs
that referenced
this pull request
Apr 28, 2026
Rename from RFC-0051 to RFC-0001. Defines the HUD-side ingestion and display layer for OOT CI results, building on the Cross-Repository CI Relay. Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1
Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted dict, determined from the allowlist rather than self-reported by downstream. Updated RelayTrusted, OotWorkflowJobRecord, and extractDynamoRecord to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id, and run_id to RelayWorkflow, OotWorkflowJobRecord, ClickHouse schema, queries, and frontend interfaces. Switch dynamoKey from run_attempt to check_run_id for per-execution uniqueness. Fix test-results key to use the hyphenated form matching the L2 action. Update buildMatrix to prefer highest run_attempt for deduplication.
The L2 action uses test_results (underscore) in the workflow dict, not test-results (hyphen). Removed failures/failed_tests_json handling — the L2 action sends summary counts only; detailed results go via artifact_url.
Surface the relay-determined repo level (L1–L4) in the OOT Summary table so users can see each backend's integration tier at a glance.
subinz1
pushed a commit
to subinz1/rfcs
that referenced
this pull request
May 13, 2026
Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1
- Changed 502 to 500 for DynamoDB write errors in results.ts (502 Bad Gateway implies a proxy; this API is the originating server) - Removed unused groupKey variable in buildMatrix() Addresses review feedback from @KarhouTam.
Both [repo].tsx and OotPrSection.tsx had identical copies of these functions. Moved to lib/oot/ootUtils.ts and imported from there. Addresses review feedback from @KarhouTam.
The L2 relay action sends run_attempt as a string (from env var). Coerce to number with Number() to ensure consistent DynamoDB/ClickHouse types. Also add artifact_url to RelayWorkflow and extract it in extractDynamoRecord so downstream-provided artifact links flow through to DynamoDB and ClickHouse.
The L2 relay action sends test_results with {passed, failed, skipped}
but no total field. Compute total_tests as the sum when tr.total is
not provided, preserving backward compatibility if total is present.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the HUD-side ingestion and display for Out-of-Tree CI results, as described in the OOT HUD RFC V3. This is the end-to-end pipeline: from receiving relay callbacks to displaying results on HUD pages.
Write Path
torchci/pages/api/oot/results.ts): Receives{trusted, untrusted}payloads from the result Lambda (PR [CRCR] Initial implementation of L2 pytorch/test-infra#7967), validates auth (x-hud-internal-bot), enforces 2MB payload cap and daily budget, extracts/flattens fields, and writes to DynamoDB (torchci-oot-workflow-job)clickhouse_db_schema/default.oot_workflow_job/schema.sql): New table with OOT-specific columns (test counts, artifact URL, environment, relay-measured timing metrics)torchci-oot-workflow-job→default.oot_workflow_jobtoclickhouse-replicator-dynamoLambdaRead Path
/oot): Table of all OOT backend repos sorted by pass rate, with avg duration and last run time/oot/[org]/[repo]): Matrix view — rows = PyTorch PRs, columns = downstream CI jobs, color-coded status chipsFiles Changed
torchci/pages/api/oot/results.tstorchci/lib/oot/ootUtils.tsclickhouse_db_schema/default.oot_workflow_job/schema.sqlaws/lambda/clickhouse-replicator-dynamo/lambda_function.pytorchci/pages/oot/index.tsxtorchci/pages/oot/[org]/[repo].tsxtorchci/components/oot/OotPrSection.tsxtorchci/pages/[repoOwner]/[repoName]/pull/[prNumber].tsxtorchci/clickhouse_queries/oot_summary/*torchci/clickhouse_queries/oot_backend_dashboard/*torchci/clickhouse_queries/oot_pr_results/*Test plan
{trusted, untrusted}payloads and writes to DynamoDBx-hud-internal-botheader/ootpage renders summary table with correct pass rates/oot/[org]/[repo]renders matrix view with correct status chipsAuthored with Claude.