[WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results#1
Draft
subinz1 wants to merge 10 commits into
Draft
Conversation
c94be0f to
d57ee39
Compare
Defines the HUD-side ingestion and display layer for OOT CI results, building on RFC-0050 (Cross-Repository CI Relay). Covers write path, storage schemas, DB protection, security, and three frontend views. Reference implementation: subinz1/test-infra#1
- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted dict, determined from the allowlist rather than self-reported by downstream. Updated the RelayPayload interface, extractDynamoRecord, and sample payloads to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id, and run_id to the workflow interface, DynamoDB schema table, ClickHouse schema, extractDynamoRecord, and sample payloads. Switch dynamoKey from run_attempt to check_run_id for per-execution uniqueness. Fix test-results key to use the hyphenated form matching the L2 action.
queue_time is only meaningful for first attempts (run_attempt == 1). On retries, the dispatch timestamp is stale and the relay should send null. HUD already handles this — it only writes queue_time when non-null. Also fixed a stale dynamoKey reference that still showed run_attempt instead of check_run_id.
Replace the 2-state model with the L2 PR's strict 3-state machine: DISPATCHED → IN_PROGRESS → COMPLETED. No shortcuts allowed — missing IN_PROGRESS or missing DISPATCHED both result in rejection. State key updated from run_id to check_run_id for per-execution uniqueness. Added mermaid diagram and updated callback token note to reflect that DISPATCHED partially addresses dispatch provenance.
Use <hardware>, <company name>, <version> placeholders instead of specific vendor or accelerator names in sample payloads, schema comments, and motivation text.
The L2 action uses test_results (underscore) in the workflow dict, not test-results (hyphen). The hyphenated name is only the GHA input name. Also removed failures array from test_results — the L2 action sends summary counts only; detailed test results go via artifact_url.
Standalone HTML mockup showing the existing HUD homepage alongside the three new OOT views (Summary, Per-Backend Dashboard, PR Integration). Built from actual torchci source code styles and component patterns.
376a49c to
24c7d9c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).
Data Flow
flowchart LR subgraph Downstream["Downstream CI (OOT Backend)"] DS["Run tests\n+ upload artifacts"] end subgraph ART["Artifact Storage (org-managed)"] STORE[("Logs, test reports,\nJUnit XML")] end subgraph Relay["Relay Server"] RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"] end subgraph HUD["HUD"] API["/api/oot/results\n• Auth check\n• Payload caps (2MB)"] end subgraph Storage["Storage"] DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")] STR["DynamoDB Stream"] REP["clickhouse-replicator-dynamo"] CH[("ClickHouse\ndefault.oot_workflow_job\n(SharedReplacingMergeTree)")] end subgraph Frontend["HUD Frontend"] P1["/oot — Global Summary"] P2["/oot/org/repo — Per-Backend"] P3["/pr/N — OOT Section"] end DS -->|"Upload artifacts"| STORE DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH RH -->|"X-OOT-Relay-Token\n{trusted, untrusted}"| API API -->|"UpdateItem"| DDB DDB --> STR --> REP --> CH CH -->|"Query results +\nartifact_url"| P1 & P2 & P3 P2 & P3 -.->|"User clicks\nexternal link"| STOREKey points:
completedcallback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouseartifact_urlfrom ClickHouse and render it as an external link — no direct connection between HUD and downstream storagein_progressandcompletedrecords are replicated to ClickHouse;SharedReplacingMergeTreehandles deduplication — when acompletedrecord arrives for the samedynamoKey, it replaces thein_progressrowWhat this RFC covers
in_progresscallbacks → DynamoDB viaUpdateItem→ replicated to ClickHouse (shows "running" indicators)completedcallbacks → DynamoDB viaUpdateItem(merges into existing record) → replicated to ClickHouse (replacesin_progressrow viaSharedReplacingMergeTree)/oot— Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)/oot/[org]/[repo]— Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)/pr/[number]— Collapsible "Out-of-Tree Backends" section in existing PR pagesX-OOT-Relay-Tokenheader, error handling strategy, signed callback token proposal, 3-state machine for status transitionsUpdateItemwrite logicReference implementation
A reference implementation is available at pytorch/test-infra#8069, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.
HUD Mockup design
The draft OOT HUD UI mockup can be seen in OOT HUD Mockup link
Status
Ready for review