Skip to content

Add reusable claude-infra-alerts workflow#8006

Open
ZainRizvi wants to merge 16 commits into
mainfrom
claude-infra-alerts-reusable
Open

Add reusable claude-infra-alerts workflow#8006
ZainRizvi wants to merge 16 commits into
mainfrom
claude-infra-alerts-reusable

Conversation

@ZainRizvi
Copy link
Copy Markdown
Contributor

Summary

  • Introduces .github/workflows/_claude-infra-alerts.yml, a reusable workflow any pytorch/meta-pytorch repo can call to run a Claude Code investigation over CI and infrastructure alerts and file structured diagnostic issues against a target repo.
  • Methodology, JSON action schema, and Write-guard hooks live under tools/claude-infra-alerts/. The workflow hardcodes the bedrock environment, model, IAM role, and the HUD MCP server; callers supply team-specific priorities, alert sources, and noise filters via the custom-instructions input.
  • Ported from meta-pytorch/pytorch-gha-infra's claude-infra-investigations workflow; the generic version keeps the high-level methodology (gather/dedup/investigate, root-cause attribution, JSON action schema) while moving team-specific alert queries and infra context out to callers.

Inputs

Input Required Purpose
custom-instructions yes Team alert sources, infra context, noise filters. Injected into the prompt.
marker-label yes Label identifying AI-filed issues. Used for dedup and auto-added to every create.
target-repo no (defaults to caller) org/repo where issues are filed. Apply step rejects mismatches.
default-labels no Newline-delimited additional labels merged into every create. Not shown to Claude.
setup-command no Bash run before the investigation step.

Inherited secret: HUD_INTERNAL_BOT_TOKEN.

Test plan

  • Land this PR.
  • Open a caller workflow in pytorch/ciforge targeting pytorch/ciforge as the issue destination, merge it, and manually trigger the workflow to confirm the reusable workflow resolves, the Claude investigation runs end-to-end, and apply-actions posts an issue against ciforge.
  • Iterate on any bugs surfaced by the ciforge run.
  • Follow-up PR in meta-pytorch/pytorch-gha-infra adds a caller that files against pytorch/pytorch.

Notes

  • Pinned to @main for pytorch/test-infra checkout per current guidance; switch to a tag once there's a second caller.
  • The IAM role gha_workflow_claude_ec2_diagnostics needs its OIDC trust policy to list every caller repo. Currently only meta-pytorch/pytorch-gha-infra is listed; pytorch/ciforge (and later meta-pytorch/pytorch-gha-infra again) need to be added before the caller workflow will succeed.

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 23, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
torchci Ignored Ignored Preview May 7, 2026 8:06pm

Request Review

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2026
@ZainRizvi ZainRizvi force-pushed the claude-infra-alerts-reusable branch from 3da89b1 to 9c11bc4 Compare May 5, 2026 20:06
ZainRizvi added 16 commits May 7, 2026 20:05
Introduces a generic, reusable GitHub Actions workflow at
.github/workflows/_claude-infra-alerts.yml that any pytorch-owned
repo can call to run a Claude Code investigation over CI and
infrastructure alerts and file structured diagnostic issues.

Methodology, JSON action schema, and Write-guard hooks live under
tools/claude-infra-alerts/. The workflow hardcodes the bedrock
environment, model, IAM role, and HUD MCP server; callers supply
team-specific priorities, alert sources, and noise filters via
the custom-instructions input.

Ported from meta-pytorch/pytorch-gha-infra's treehugger
workflow, with the methodology generalized so exact alert
queries and infrastructure context become caller-owned.
Ensures the skill files and hooks used at runtime always match the
version of the reusable workflow YAML being executed, regardless of
whether the caller pins to @main, a branch, or a tag.
Previous version hardcoded the gha-infra-specific
gha_workflow_claude_ec2_diagnostics role, which only trusts
meta-pytorch/pytorch-gha-infra via OIDC. Switch the default to
the shared gha_workflow_claude_code role (already trusted by
other pytorch/meta-pytorch repos that use the _claude-code.yml
reusable workflow) and expose a new aws-role input so callers
that need a narrower role (e.g. the gha-infra caller) can
override.
When the token isn't provided by the caller, skip wiring up the HUD
MCP server. The investigator still has gh/aws/bash tools. This
unblocks repos (like pytorch/ciforge) that want to test-drive the
reusable workflow without provisioning a HUD token first.
github.workflow_sha resolves to a SHA that actions/checkout can't
fetch for unreleased commits. Parse github.workflow_ref instead to
get a clean branch/tag name that tracks the caller's pinned ref.
github.workflow_ref inside a reusable workflow refers to the caller's
workflow file, not the reusable one — so ref resolution was returning
the caller's branch (e.g. 'main' on ciforge) instead of the branch
the caller pinned on test-infra. Use github.job_workflow_sha instead,
which is documented as the commit SHA of the called reusable workflow.
GitHub doesn't expose the caller-pinned ref as a context variable
inside reusable workflows (github.job_workflow_sha resolved empty;
github.workflow_sha returned the caller's commit). Introduce a
test-infra-ref input so callers explicitly pin the test-infra
checkout to the same ref they're using for the workflow file.
1. validate.sh: rely on check-jsonschema exit code; stop matching the
   literal word "ok" in its output (stdout wording is not a stable API).
2. apply-actions: upsert_first_comment -> upsert_details_comment, keyed
   off a hidden HTML marker (`<!-- claude-infra-alerts-details -->`) so
   we don't clobber a human-authored first comment if a previous run
   crashed partway through or someone replied before we did an update.
3. Onboarding header comment: stop naming a specific IAM role that
   differs from the aws-role default. Describe it as "the IAM role this
   workflow assumes" and mention the override input for clarity.
4. Three hooks (restrict-write, validate-post-write, validate-on-stop)
   now set -euo pipefail, matching validate.sh. The Write-restriction
   hook needed this the most: a silent jq failure there could mask a
   bypass.
5. SKILL.md line about the JSON schema path: points at
   .claude/skills/claude-infra-alerts/actions-schema.json (the staged
   location Claude can actually read) instead of the source path inside
   the test-infra checkout.
- apply-actions now re-validates Claude's JSON output against the
  canonical schema (via check-jsonschema) before any gh writes. Catches
  drift that the investigate-side Stop hook may miss if the outer
  claude-code-action wrapper terminates Claude without a clean stop.
- Label auto-creation: ensure_labels() is called before every
  gh issue create so the workflow is self-onboarding on new target
  repos. Labels we auto-create get a neutral grey color plus a
  description noting the automation.
- update_run_log: fetches issue comments once per call, surfaces
  failures from the fetch / PATCH / post paths as warnings instead of
  silently dropping audit log updates. Migrated the PATCH from curl to
  gh api --method PATCH for consistency.
- upsert_details_comment: refuses to upsert a whitespace-only body (a
  future schema regression must not wipe an existing details comment),
  and surfaces PATCH/post failures.
- Regex normalization: the two remaining `grep -o '[0-9]*\$'` call sites
  match the safer `grep -oE '[0-9]+\$'` convention used elsewhere.
- Tightened the investigate job's Bash(gh *) allowlist to read-only
  subcommands (gh issue view/list, gh pr view/list, gh search, gh run
  view/list, gh workflow view/list, gh label list, gh api). Token
  permissions already block writes at the API boundary; this narrows
  the prompt-injection surface further.
- Hoisted target_repo to an investigate-job output so apply-actions
  doesn't have to re-resolve it.
- apply-actions: download-artifact uses continue-on-error and downstream
  steps gate on its outcome, so the "no actions to apply" path no longer
  fails the job when Claude produced no output.
- update_run_log: on `gh issue view` failure, return early without
  writing instead of fabricating an empty comment list — the latter
  would orphan the existing audit-log comment by posting a fresh one.
- upsert_details_comment: distinguish "no marker comment yet" from
  "comment lookup failed". On lookup failure, abort the function so we
  never accidentally post a duplicate authoritative comment.
- gh issue create: capture stderr and exit status explicitly. set -e is
  unreliable inside `var=$(cmd)`; without this fix a failed create
  surfaces as a misleading grep error from the downstream issue-number
  parse instead of the real gh failure.
- ensure_labels: capture stderr and only swallow "already exists";
  surface auth/scope/validation errors as warnings instead of dropping
  them silently.
- Per-action gh writes (issue edit / comment / close) now have
  contextual error wrappers with the action index and issue number,
  so multi-action runs are debuggable from the log.
- sed substitution of <TARGET_REPO> / <MARKER_LABEL> in SKILL.md now
  escapes |, &, and \ in the replacement so labels containing those
  characters (legal on GitHub) don't corrupt the prompt.
- Allowlist: added gh repo view, gh release list, gh release view (all
  read-only).
- validate.sh: log a loud warning when falling back to the jq-based
  structural check, since that path doesn't enforce minLength/types.
Test-infra reviewers (code-reviewer + silent-failure-hunter) flagged 8
issues; this commit addresses all of them.

- Pre-install check-jsonschema in the investigate job, eliminating the
  jq-based fallback path in validate.sh. The fallback was visually
  loud (printed a stderr warning) but Claude Code Actions hooks don't
  surface stderr to the LLM on exit-0, so a silent install failure
  could downgrade to weak structural validation without anyone
  noticing. validate.sh now hard-errors when check-jsonschema is
  missing.
- gh issue create: capture stderr to a tempfile so success-path
  warnings (gh release notices, label warnings, etc.) can't pollute
  stdout and corrupt the parsed issue URL. Also match the URL line
  specifically rather than trusting that the last line ends in digits.
- All gh write paths (issue create / edit / comment / close, plus
  upsert_details_comment's new-comment branch) now use --body-file
  (or stdin) instead of --body "$VAR", consistent with the PATCH
  path. Avoids ARG_MAX / shell-quoting issues with long detailed
  analyses containing newlines or backticks.
- Labels are now passed as repeated --label flags built from a bash
  array, instead of a comma-separated string. GitHub permits commas
  in label names; the previous CSV approach silently fragmented them.
  ensure_labels signature changed accordingly (repo + array).
- update_run_log audit-trail failures now emit `::warning ::`
  annotations so they surface in the GitHub Actions run summary UI
  instead of getting buried in 50-minute Claude run logs. Same for
  the no-actions-artifact reporting and ensure_labels failures.
- ensure_labels failure case now uses ::warning :: too.
- "No actions artifact retrieved" message reworded to acknowledge the
  ambiguity (could be no upload OR a transient download failure) so
  operators know to check the investigate-job logs.
- upsert_details_comment failure during a create action is now
  non-fatal: the issue exists with summary-as-body, the next run's
  dedup-by-marker self-heals via an update action. Emit a ::warning ::
  with the issue URL so the partial state is visible. Update path
  remains fatal (the issue already exists, no self-heal needed).
- Tightened Bash(git *) allowlist to read-only subcommands (log,
  show, diff, blame, status, rev-parse, ls-files, remote -v / show,
  tag, branch -a). The previous broad git allow let Claude run git
  clone, git push, git config, etc. — none documented in the skill.

No new design changes; all of these tighten the existing design's
failure modes.
R3 reviewers flagged 8 residual issues, all addressed here:

- ::warning :: -> ::warning:: (no space). GitHub Actions parses the
  command name strictly; the trailing-space form was rendering as a
  plain log line instead of populating the run-summary annotations
  panel, defeating the round-2 goal of making partial failures
  visible. (6 occurrences updated.)
- upsert_details_comment / update_run_log: stage each
  `var=$(cmd | jq | grep)` lookup into separate variables with
  explicit jq error checks. The combined-pipeline form mid-pipeline
  failures inside command substitution mask jq errors even with
  pipefail, silently degrading "lookup failed" into "no marker
  found" — which would post a duplicate authoritative details
  comment (the exact mode upsert_details_comment is designed to
  prevent) or fork the audit-trail comment.
- mapfile label list: stage the jq output into LABELS_RAW with an
  explicit exit-code check, then mapfile from a here-string. Process
  substitution (`< <(jq …)`) discards jq exit codes even with
  pipefail. Belt-and-suspenders: assert MARKER_LABEL is in the
  resulting array — a partial array missing the marker would break
  dedup and produce duplicate-issue spam every hour.
- Audit log: when upsert_details_comment fails on create, the
  audit-log row is now tagged "create (details pending)" instead of
  the misleading "create". The run-UI ::warning:: is ephemeral; the
  audit comment is the long-term record.
- ensure_labels: aggregate per-label failures into a single
  ::warning:: at the end of the call instead of emitting one per
  label. GitHub Actions surfaces only the first ~10 warning
  annotations in the summary UI, so a busy run with many label
  failures could overflow. Per-label errors still go to the step log.
- gh issue create URL parse: tighten the URL regex to literal
  `/issues/<num>` so a stray release-notice URL on stdout (rare but
  possible across gh CLI versions) can't be picked up as the issue
  URL and corrupt downstream issue-number parsing.
- /tmp/tracking-comment.md: cleaned up after each update_run_log
  call, including on the early-return error paths.
- Fail fast when MARKER_LABEL is empty. workflow_call inputs marked
  required:true only enforce presence, not non-emptiness; an empty
  marker would make `grep -Fxq ""` always match (every line "contains"
  empty), defeating the round-3 marker-label-presence assertion that
  guards against duplicate-issue spam.
- Stage the audit-log row-extraction pipeline (sed -> grep -v -> grep)
  in update_run_log. Previously the trailing `|| true` masked sed/grep
  crashes anywhere upstream, silently truncating the audit trail. Now
  failures from sed and the header-strip grep emit warnings and skip
  the update; only the final data-row filter still uses `|| true`
  legitimately (empty match means "no prior rows", which is correct).
- Stage body_json=$(jq -Rs ...) calls in update_run_log and
  upsert_details_comment. A jq failure (invalid UTF-8 in body, etc.)
  would otherwise silently produce body_json="" and then post an empty
  PATCH that wipes content.
- Document the bash >= 4.4 requirement (for empty-array expansion
  under set -u). ubuntu-latest ships bash 5.x so this is currently
  enforced by the runner; the comment is for future portability.
R5 reviewers (code-reviewer, silent-failure-hunter, comment-analyzer)
flagged 6 issues — all addressed.

- Drop dead branch in update_run_log audit-log staging. The previous
  round added an `if !` wrapper around `... | g_rep -v ... || true`,
  but the trailing `|| true` makes the pipeline always exit 0, so the
  warning branch was structurally unreachable. Replace with a cleaner
  three-stage layout: stage 1 (sed range extract) is checked
  explicitly because a sed parse failure is always a real error;
  stages 2 and 3 (header strip + row filter) legitimately exit 1
  when filters match nothing, so they share a single `|| true`.
- Empty MARKER_LABEL fail-fast moved earlier — into the investigate
  job's "Stage skill and hooks" step. The previous check fired only
  in apply-actions, after the investigate job had already burned up
  to 50 minutes of Bedrock budget on a sed-corrupted skill.
- Real bug: validate.sh resolved its schema path via
  $(dirname "$0")/../skill/actions-schema.json, which works in the
  source repo layout but breaks at runtime because hooks are staged
  to .claude/hooks/claude-infra-alerts/ while the schema lives at
  .claude/skills/claude-infra-alerts/. The relative path resolved
  to a non-existent .claude/hooks/skill/... Stage a copy of the
  schema next to the hooks too, and update validate.sh to look for
  actions-schema.json in its own directory.
- Drop stale "weaker jq-based fallback" rationale from the YAML
  install-step comment and validate.sh header — that fallback was
  removed in round 2; the comments documented removed behavior.
- Fix --body-file comment that claimed "(stdin)" but the code uses
  a tempfile.
- SKILL.md "Available Tools" no longer claims unrestricted gh/git
  access; lists the actual read-only subcommands the workflow's
  allowlist permits.
The tightened read-only-subcommand allowlist was defense-in-depth, but
the actual safety boundary is the GITHUB_TOKEN scope (issues:read in
the investigate job) and the read-only IAM role. Mutations get
rejected at the API boundary regardless of what's in the allowlist.

The narrower form was also a maintenance burden: every read pattern
Claude reasonably needed had to be enumerated, and missing entries
showed up as silent investigation degradations (Claude unable to run
gh repo view, gh release list, etc.).

Restore Bash(gh *) and Bash(git *). SKILL.md description updated to
match.
@ZainRizvi ZainRizvi force-pushed the claude-infra-alerts-reusable branch from 6ef3e84 to 650fece Compare May 7, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant