Skip to content

[WIP] OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages#8069

Draft
subinz1 wants to merge 7 commits into
pytorch:mainfrom
subinz1:oot-hud-pipeline
Draft

[WIP] OOT HUD: Full ingestion pipeline — API, schemas, queries, and frontend pages#8069
subinz1 wants to merge 7 commits into
pytorch:mainfrom
subinz1:oot-hud-pipeline

Conversation

@subinz1
Copy link
Copy Markdown

@subinz1 subinz1 commented May 12, 2026

Summary

Reference implementation for the HUD-side OOT CI integration, as specified in RFC-0054. This PR adds the full ingestion and display pipeline for out-of-tree backend CI results.

Note

This is a WIP/draft — not ready for merge. Opened for visibility and cross-referencing with the RFC and L2 relay PR (#7967).

What's included

Write path (relay → HUD → storage):

  • torchci/pages/api/oot/results.ts — API endpoint authenticated via X-OOT-Relay-Token
  • torchci/lib/oot/ootUtils.ts — Payload extraction, type definitions (RelayPayload, OotWorkflowJobRecord), and DynamoDB UpdateItem write logic
  • clickhouse_db_schema/default.oot_workflow_job/schema.sql — ClickHouse table with SharedReplacingMergeTree for upsert semantics
  • aws/lambda/clickhouse-replicator-dynamo/lambda_function.py — Replicator mapping for DynamoDB → ClickHouse

Read path (ClickHouse → frontend):

  • torchci/clickhouse_queries/oot_summary/ — Global OOT summary query
  • torchci/clickhouse_queries/oot_backend_dashboard/ — Per-backend dashboard query
  • torchci/clickhouse_queries/oot_pr_results/ — Per-PR results query
  • torchci/pages/oot/index.tsx — Global OOT summary page (/oot)
  • torchci/pages/oot/[org]/[repo].tsx — Per-backend matrix dashboard
  • torchci/components/oot/OotPrSection.tsx — PR page integration component

Key design decisions

  • check_run_id used in dynamoKey for per-execution uniqueness (not run_attempt)
  • downstream_repo_level sourced from trusted payload (relay-determined from allowlist)
  • test-results uses hyphenated key matching the L2 action wire format
  • UpdateItem with dynamic SET expressions — prevents completed callbacks from clobbering in_progress fields
  • Null-safe timing metricsqueue_time/execution_time only written when non-null

Related

subinz1 added 4 commits April 24, 2026 15:52
…mapping, and frontend pages

Implements the HUD-side ingestion and display for Out-of-Tree CI results,
as described in the OOT HUD RFC V3. The relay (PR pytorch#7967) forwards
{trusted, untrusted} payloads to the new /api/oot/results endpoint, which
validates, extracts fields, and writes to DynamoDB. DynamoDB Streams
replicates to ClickHouse via the existing replicator Lambda. Three frontend
views display the results: a global OOT summary, a per-backend matrix
dashboard, and a collapsible section on PR pages.
Address @ZainRizvi's review on pytorch/rfcs#96:

- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: removed schema validation from HUD (moved to relay)
- Removed daily budget enforcement
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Timestamps: use downstream-reported started_at/completed_at instead of now()
- Timing metrics: only set queue_time/execution_time when non-null
- ClickHouse schema: added job_name, run_attempt columns
- Queries: select job_name, run_attempt as proper columns
- Frontend: updated interfaces to include new fields
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted
dict, determined from the allowlist rather than self-reported by
downstream. Updated RelayTrusted, OotWorkflowJobRecord, and
extractDynamoRecord to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id,
and run_id to RelayWorkflow, OotWorkflowJobRecord, ClickHouse schema,
queries, and frontend interfaces. Switch dynamoKey from run_attempt to
check_run_id for per-execution uniqueness. Fix test-results key to use
the hyphenated form matching the L2 action. Update buildMatrix to
prefer highest run_attempt for deduplication.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

@subinz1 is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2026
subinz1 added 3 commits May 12, 2026 16:33
The L2 action uses test_results (underscore) in the workflow dict, not
test-results (hyphen). Removed failures/failed_tests_json handling —
the L2 action sends summary counts only; detailed results go via
artifact_url.
@subinz1 subinz1 force-pushed the oot-hud-pipeline branch from 3370b70 to 7a45ca7 Compare May 12, 2026 11:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant