Skip to content

[WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results#1

Draft
subinz1 wants to merge 10 commits into
masterfrom
oot-hud-integration-rfc
Draft

[WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results#1
subinz1 wants to merge 10 commits into
masterfrom
oot-hud-integration-rfc

Conversation

@subinz1
Copy link
Copy Markdown
Owner

@subinz1 subinz1 commented Apr 28, 2026

Summary

This RFC defines the HUD-side ingestion and display layer for Out-of-Tree (OOT) CI results, building on RFC-0050 (Cross-Repository CI Relay for PyTorch Out-of-Tree Backends).

Data Flow

flowchart LR
    subgraph Downstream["Downstream CI (OOT Backend)"]
        DS["Run tests\n+ upload artifacts"]
    end

    subgraph ART["Artifact Storage (org-managed)"]
        STORE[("Logs, test reports,\nJUnit XML")]
    end

    subgraph Relay["Relay Server"]
        RH["Result Handler\n• OIDC verify\n• Allowlist check\n• Rate limit"]
    end

    subgraph HUD["HUD"]
        API["/api/oot/results\n• Auth check\n• Payload caps (2MB)"]
    end

    subgraph Storage["Storage"]
        DDB[("DynamoDB\ntorchci-oot-workflow-job\n(in_progress + completed)")]
        STR["DynamoDB Stream"]
        REP["clickhouse-replicator-dynamo"]
        CH[("ClickHouse\ndefault.oot_workflow_job\n(SharedReplacingMergeTree)")]
    end

    subgraph Frontend["HUD Frontend"]
        P1["/oot — Global Summary"]
        P2["/oot/org/repo — Per-Backend"]
        P3["/pr/N — OOT Section"]
    end

    DS -->|"Upload artifacts"| STORE
    DS -->|"① POST in_progress\n② POST completed\n+ artifact_url\n(OIDC token)"| RH
    RH -->|"X-OOT-Relay-Token\n{trusted, untrusted}"| API
    API -->|"UpdateItem"| DDB
    DDB --> STR --> REP --> CH
    CH -->|"Query results +\nartifact_url"| P1 & P2 & P3
    P2 & P3 -.->|"User clicks\nexternal link"| STORE
Loading

Key points:

  • Artifact URLs are included in the completed callback payload and flow through the Result Handler → HUD API → DynamoDB → ClickHouse
  • HUD pages read artifact_url from ClickHouse and render it as an external link — no direct connection between HUD and downstream storage
  • Both in_progress and completed records are replicated to ClickHouse; SharedReplacingMergeTree handles deduplication — when a completed record arrives for the same dynamoKey, it replaces the in_progress row

What this RFC covers

  • Write path: Downstream CI → Result Handler → HUD API → DynamoDB → ClickHouse
    • in_progress callbacks → DynamoDB via UpdateItem → replicated to ClickHouse (shows "running" indicators)
    • completed callbacks → DynamoDB via UpdateItem (merges into existing record) → replicated to ClickHouse (replaces in_progress row via SharedReplacingMergeTree)
    • Artifact URLs flow through the callback payload, not sent directly to HUD
  • Read path: Three new HUD views:
    • /oot — Global OOT CI summary (cross-repo health overview, repos sorted by pass rate)
    • /oot/[org]/[repo] — Per-backend dashboard (matrix view: PRs × jobs, failure drill-down, external artifact links)
    • /pr/[number] — Collapsible "Out-of-Tree Backends" section in existing PR pages
  • Storage schemas: DynamoDB table and ClickHouse table designs
  • DB protection: Rate limiting (per-repo at relay), payload caps (2MB at HUD API)
  • Security: OIDC authentication, trusted/untrusted payload split, dedicated X-OOT-Relay-Token header, error handling strategy, signed callback token proposal, 3-state machine for status transitions
  • Sample payloads: Two-stage wire format examples (downstream → relay, relay → HUD) with full field definitions
  • Implementation plan: 6-phase rollout with task-level breakdown:
    1. Storage Layer — DynamoDB + ClickHouse + replicator mapping
    2. HUD API Endpoint — types, extraction, UpdateItem write logic
    3. Relay Integration — handler → HUD forwarding, rate limiting, reusable GHA action
    4. HUD Frontend Pages — 3 views + saved ClickHouse queries
    5. End-to-End Validation — real downstream repo testing
    6. Security Hardening — callback token, state machine (future)

Reference implementation

A reference implementation is available at pytorch/test-infra#8069, which includes the API endpoint, ClickHouse schema, replicator mapping, saved ClickHouse queries, and all three frontend pages.

HUD Mockup design

The draft OOT HUD UI mockup can be seen in OOT HUD Mockup link

Status

Ready for review

@subinz1 subinz1 force-pushed the oot-hud-integration-rfc branch from c94be0f to d57ee39 Compare April 28, 2026 11:38
@subinz1 subinz1 changed the title [WIP] RFC-0051: HUD Integration for Out-of-Tree CI Results [WIP] RFC-0001: HUD Integration for Out-of-Tree CI Results Apr 28, 2026
jewelkm89 and others added 10 commits April 29, 2026 22:29
Defines the HUD-side ingestion and display layer for OOT CI results,
building on RFC-0050 (Cross-Repository CI Relay). Covers write path,
storage schemas, DB protection, security, and three frontend views.
Reference implementation: subinz1/test-infra#1
- Auth: X-Hud-Internal-Bot → dedicated X-OOT-Relay-Token header
- Validation: moved schema validation from HUD to relay (Hop 1)
- Replication: both in_progress and completed records go to ClickHouse
- Timestamps: use downstream-reported started_at/completed_at
- DynamoDB: PutItem → UpdateItem to prevent null clobbering
- DynamoKey: expanded to {repo}/{delivery_id}/{workflow_name}/{job_name}/{run_attempt}
- Schema: added job_name, run_attempt columns
- Rate limit: 10 → 20 req/min (matches L2 PR default)
- Error handling: updated to match actual retry + raise behavior
- Sample payloads: rewritten to match actual nested {trusted, untrusted} wire format
The L2 relay now includes downstream_repo_level (L1–L4) in the trusted
dict, determined from the allowlist rather than self-reported by
downstream. Updated the RelayPayload interface, extractDynamoRecord, and
sample payloads to read it from trusted.
Align with updated L2 workflow dict: add schema_version, check_run_id,
and run_id to the workflow interface, DynamoDB schema table, ClickHouse
schema, extractDynamoRecord, and sample payloads. Switch dynamoKey from
run_attempt to check_run_id for per-execution uniqueness. Fix
test-results key to use the hyphenated form matching the L2 action.
queue_time is only meaningful for first attempts (run_attempt == 1).
On retries, the dispatch timestamp is stale and the relay should send
null. HUD already handles this — it only writes queue_time when
non-null. Also fixed a stale dynamoKey reference that still showed
run_attempt instead of check_run_id.
Replace the 2-state model with the L2 PR's strict 3-state machine:
DISPATCHED → IN_PROGRESS → COMPLETED. No shortcuts allowed — missing
IN_PROGRESS or missing DISPATCHED both result in rejection. State key
updated from run_id to check_run_id for per-execution uniqueness.
Added mermaid diagram and updated callback token note to reflect that
DISPATCHED partially addresses dispatch provenance.
Use <hardware>, <company name>, <version> placeholders instead of
specific vendor or accelerator names in sample payloads, schema
comments, and motivation text.
The L2 action uses test_results (underscore) in the workflow dict, not
test-results (hyphen). The hyphenated name is only the GHA input name.
Also removed failures array from test_results — the L2 action sends
summary counts only; detailed test results go via artifact_url.
Standalone HTML mockup showing the existing HUD homepage alongside the
three new OOT views (Summary, Per-Backend Dashboard, PR Integration).
Built from actual torchci source code styles and component patterns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants