[WIP] OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages#8069
Draft
subinz1 wants to merge 7 commits into
Draft
[WIP] OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages#8069subinz1 wants to merge 7 commits into
subinz1 wants to merge 7 commits into
Conversation
…mapping, and frontend pages Implements the HUD-side ingestion and display for Out-of-Tree CI results, as described in the OOT HUD RFC V3. The relay (PR pytorch#7967) forwards {trusted, untrusted} payloads to the new /api/oot/results endpoint, which validates, extracts fields, and writes to DynamoDB. DynamoDB Streams replicates to ClickHouse via the existing replicator Lambda. Three frontend views display the results: a global OOT summary, a per-backend matrix dashboard, and a collapsible section on PR pages.
Address @ZainRizvi's review on pytorch/rfcs#96: - Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header - Validation: removed schema validation from HUD (moved to relay) - Removed daily budget enforcement - DynamoDB: PutItem → UpdateItem to prevent null clobbering - DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt} - Timestamps: use downstream-reported started_at/completed_at instead of now() - Timing metrics: only set queue_time/execution_time when non-null - ClickHouse schema: added job_name, run_attempt columns - Queries: select job_name, run_attempt as proper columns - Frontend: updated interfaces to include new fields
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted dict, determined from the allowlist rather than self-reported by downstream. Updated RelayTrusted, OotWorkflowJobRecord, and extractDynamoRecord to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id, and run_id to RelayWorkflow, OotWorkflowJobRecord, ClickHouse schema, queries, and frontend interfaces. Switch dynamoKey from run_attempt to check_run_id for per-execution uniqueness. Fix test-results key to use the hyphenated form matching the L2 action. Update buildMatrix to prefer highest run_attempt for deduplication.
|
@subinz1 is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
The L2 action uses test_results (underscore) in the workflow dict, not test-results (hyphen). Removed failures/failed_tests_json handling — the L2 action sends summary counts only; detailed results go via artifact_url.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reference implementation for the HUD-side OOT CI integration, as specified in RFC-0054. This PR adds the full ingestion and display pipeline for out-of-tree backend CI results.
Note
This is a WIP/draft — not ready for merge. Opened for visibility and cross-referencing with the RFC and L2 relay PR (#7967).
What's included
Write path (relay → HUD → storage):
torchci/pages/api/oot/results.ts— API endpoint authenticated viaX-OOT-Relay-Tokentorchci/lib/oot/ootUtils.ts— Payload extraction, type definitions (RelayPayload,OotWorkflowJobRecord), and DynamoDBUpdateItemwrite logicclickhouse_db_schema/default.oot_workflow_job/schema.sql— ClickHouse table withSharedReplacingMergeTreefor upsert semanticsaws/lambda/clickhouse-replicator-dynamo/lambda_function.py— Replicator mapping for DynamoDB → ClickHouseRead path (ClickHouse → frontend):
torchci/clickhouse_queries/oot_summary/— Global OOT summary querytorchci/clickhouse_queries/oot_backend_dashboard/— Per-backend dashboard querytorchci/clickhouse_queries/oot_pr_results/— Per-PR results querytorchci/pages/oot/index.tsx— Global OOT summary page (/oot)torchci/pages/oot/[org]/[repo].tsx— Per-backend matrix dashboardtorchci/components/oot/OotPrSection.tsx— PR page integration componentKey design decisions
check_run_idused indynamoKeyfor per-execution uniqueness (notrun_attempt)downstream_repo_levelsourced fromtrustedpayload (relay-determined from allowlist)test-resultsuses hyphenated key matching the L2 action wire formatUpdateItemwith dynamicSETexpressions — prevents completed callbacks from clobbering in_progress fieldsqueue_time/execution_timeonly written when non-nullRelated