Add reusable claude-infra-alerts workflow#8006
Open
ZainRizvi wants to merge 16 commits into
Open
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
3da89b1 to
9c11bc4
Compare
Introduces a generic, reusable GitHub Actions workflow at .github/workflows/_claude-infra-alerts.yml that any pytorch-owned repo can call to run a Claude Code investigation over CI and infrastructure alerts and file structured diagnostic issues. Methodology, JSON action schema, and Write-guard hooks live under tools/claude-infra-alerts/. The workflow hardcodes the bedrock environment, model, IAM role, and HUD MCP server; callers supply team-specific priorities, alert sources, and noise filters via the custom-instructions input. Ported from meta-pytorch/pytorch-gha-infra's treehugger workflow, with the methodology generalized so exact alert queries and infrastructure context become caller-owned.
Ensures the skill files and hooks used at runtime always match the version of the reusable workflow YAML being executed, regardless of whether the caller pins to @main, a branch, or a tag.
Previous version hardcoded the gha-infra-specific gha_workflow_claude_ec2_diagnostics role, which only trusts meta-pytorch/pytorch-gha-infra via OIDC. Switch the default to the shared gha_workflow_claude_code role (already trusted by other pytorch/meta-pytorch repos that use the _claude-code.yml reusable workflow) and expose a new aws-role input so callers that need a narrower role (e.g. the gha-infra caller) can override.
When the token isn't provided by the caller, skip wiring up the HUD MCP server. The investigator still has gh/aws/bash tools. This unblocks repos (like pytorch/ciforge) that want to test-drive the reusable workflow without provisioning a HUD token first.
github.workflow_sha resolves to a SHA that actions/checkout can't fetch for unreleased commits. Parse github.workflow_ref instead to get a clean branch/tag name that tracks the caller's pinned ref.
github.workflow_ref inside a reusable workflow refers to the caller's workflow file, not the reusable one — so ref resolution was returning the caller's branch (e.g. 'main' on ciforge) instead of the branch the caller pinned on test-infra. Use github.job_workflow_sha instead, which is documented as the commit SHA of the called reusable workflow.
GitHub doesn't expose the caller-pinned ref as a context variable inside reusable workflows (github.job_workflow_sha resolved empty; github.workflow_sha returned the caller's commit). Introduce a test-infra-ref input so callers explicitly pin the test-infra checkout to the same ref they're using for the workflow file.
1. validate.sh: rely on check-jsonschema exit code; stop matching the literal word "ok" in its output (stdout wording is not a stable API). 2. apply-actions: upsert_first_comment -> upsert_details_comment, keyed off a hidden HTML marker (`<!-- claude-infra-alerts-details -->`) so we don't clobber a human-authored first comment if a previous run crashed partway through or someone replied before we did an update. 3. Onboarding header comment: stop naming a specific IAM role that differs from the aws-role default. Describe it as "the IAM role this workflow assumes" and mention the override input for clarity. 4. Three hooks (restrict-write, validate-post-write, validate-on-stop) now set -euo pipefail, matching validate.sh. The Write-restriction hook needed this the most: a silent jq failure there could mask a bypass. 5. SKILL.md line about the JSON schema path: points at .claude/skills/claude-infra-alerts/actions-schema.json (the staged location Claude can actually read) instead of the source path inside the test-infra checkout.
- apply-actions now re-validates Claude's JSON output against the canonical schema (via check-jsonschema) before any gh writes. Catches drift that the investigate-side Stop hook may miss if the outer claude-code-action wrapper terminates Claude without a clean stop. - Label auto-creation: ensure_labels() is called before every gh issue create so the workflow is self-onboarding on new target repos. Labels we auto-create get a neutral grey color plus a description noting the automation. - update_run_log: fetches issue comments once per call, surfaces failures from the fetch / PATCH / post paths as warnings instead of silently dropping audit log updates. Migrated the PATCH from curl to gh api --method PATCH for consistency. - upsert_details_comment: refuses to upsert a whitespace-only body (a future schema regression must not wipe an existing details comment), and surfaces PATCH/post failures. - Regex normalization: the two remaining `grep -o '[0-9]*\$'` call sites match the safer `grep -oE '[0-9]+\$'` convention used elsewhere. - Tightened the investigate job's Bash(gh *) allowlist to read-only subcommands (gh issue view/list, gh pr view/list, gh search, gh run view/list, gh workflow view/list, gh label list, gh api). Token permissions already block writes at the API boundary; this narrows the prompt-injection surface further. - Hoisted target_repo to an investigate-job output so apply-actions doesn't have to re-resolve it.
- apply-actions: download-artifact uses continue-on-error and downstream steps gate on its outcome, so the "no actions to apply" path no longer fails the job when Claude produced no output. - update_run_log: on `gh issue view` failure, return early without writing instead of fabricating an empty comment list — the latter would orphan the existing audit-log comment by posting a fresh one. - upsert_details_comment: distinguish "no marker comment yet" from "comment lookup failed". On lookup failure, abort the function so we never accidentally post a duplicate authoritative comment. - gh issue create: capture stderr and exit status explicitly. set -e is unreliable inside `var=$(cmd)`; without this fix a failed create surfaces as a misleading grep error from the downstream issue-number parse instead of the real gh failure. - ensure_labels: capture stderr and only swallow "already exists"; surface auth/scope/validation errors as warnings instead of dropping them silently. - Per-action gh writes (issue edit / comment / close) now have contextual error wrappers with the action index and issue number, so multi-action runs are debuggable from the log. - sed substitution of <TARGET_REPO> / <MARKER_LABEL> in SKILL.md now escapes |, &, and \ in the replacement so labels containing those characters (legal on GitHub) don't corrupt the prompt. - Allowlist: added gh repo view, gh release list, gh release view (all read-only). - validate.sh: log a loud warning when falling back to the jq-based structural check, since that path doesn't enforce minLength/types.
Test-infra reviewers (code-reviewer + silent-failure-hunter) flagged 8 issues; this commit addresses all of them. - Pre-install check-jsonschema in the investigate job, eliminating the jq-based fallback path in validate.sh. The fallback was visually loud (printed a stderr warning) but Claude Code Actions hooks don't surface stderr to the LLM on exit-0, so a silent install failure could downgrade to weak structural validation without anyone noticing. validate.sh now hard-errors when check-jsonschema is missing. - gh issue create: capture stderr to a tempfile so success-path warnings (gh release notices, label warnings, etc.) can't pollute stdout and corrupt the parsed issue URL. Also match the URL line specifically rather than trusting that the last line ends in digits. - All gh write paths (issue create / edit / comment / close, plus upsert_details_comment's new-comment branch) now use --body-file (or stdin) instead of --body "$VAR", consistent with the PATCH path. Avoids ARG_MAX / shell-quoting issues with long detailed analyses containing newlines or backticks. - Labels are now passed as repeated --label flags built from a bash array, instead of a comma-separated string. GitHub permits commas in label names; the previous CSV approach silently fragmented them. ensure_labels signature changed accordingly (repo + array). - update_run_log audit-trail failures now emit `::warning ::` annotations so they surface in the GitHub Actions run summary UI instead of getting buried in 50-minute Claude run logs. Same for the no-actions-artifact reporting and ensure_labels failures. - ensure_labels failure case now uses ::warning :: too. - "No actions artifact retrieved" message reworded to acknowledge the ambiguity (could be no upload OR a transient download failure) so operators know to check the investigate-job logs. - upsert_details_comment failure during a create action is now non-fatal: the issue exists with summary-as-body, the next run's dedup-by-marker self-heals via an update action. Emit a ::warning :: with the issue URL so the partial state is visible. Update path remains fatal (the issue already exists, no self-heal needed). - Tightened Bash(git *) allowlist to read-only subcommands (log, show, diff, blame, status, rev-parse, ls-files, remote -v / show, tag, branch -a). The previous broad git allow let Claude run git clone, git push, git config, etc. — none documented in the skill. No new design changes; all of these tighten the existing design's failure modes.
R3 reviewers flagged 8 residual issues, all addressed here: - ::warning :: -> ::warning:: (no space). GitHub Actions parses the command name strictly; the trailing-space form was rendering as a plain log line instead of populating the run-summary annotations panel, defeating the round-2 goal of making partial failures visible. (6 occurrences updated.) - upsert_details_comment / update_run_log: stage each `var=$(cmd | jq | grep)` lookup into separate variables with explicit jq error checks. The combined-pipeline form mid-pipeline failures inside command substitution mask jq errors even with pipefail, silently degrading "lookup failed" into "no marker found" — which would post a duplicate authoritative details comment (the exact mode upsert_details_comment is designed to prevent) or fork the audit-trail comment. - mapfile label list: stage the jq output into LABELS_RAW with an explicit exit-code check, then mapfile from a here-string. Process substitution (`< <(jq …)`) discards jq exit codes even with pipefail. Belt-and-suspenders: assert MARKER_LABEL is in the resulting array — a partial array missing the marker would break dedup and produce duplicate-issue spam every hour. - Audit log: when upsert_details_comment fails on create, the audit-log row is now tagged "create (details pending)" instead of the misleading "create". The run-UI ::warning:: is ephemeral; the audit comment is the long-term record. - ensure_labels: aggregate per-label failures into a single ::warning:: at the end of the call instead of emitting one per label. GitHub Actions surfaces only the first ~10 warning annotations in the summary UI, so a busy run with many label failures could overflow. Per-label errors still go to the step log. - gh issue create URL parse: tighten the URL regex to literal `/issues/<num>` so a stray release-notice URL on stdout (rare but possible across gh CLI versions) can't be picked up as the issue URL and corrupt downstream issue-number parsing. - /tmp/tracking-comment.md: cleaned up after each update_run_log call, including on the early-return error paths.
- Fail fast when MARKER_LABEL is empty. workflow_call inputs marked required:true only enforce presence, not non-emptiness; an empty marker would make `grep -Fxq ""` always match (every line "contains" empty), defeating the round-3 marker-label-presence assertion that guards against duplicate-issue spam. - Stage the audit-log row-extraction pipeline (sed -> grep -v -> grep) in update_run_log. Previously the trailing `|| true` masked sed/grep crashes anywhere upstream, silently truncating the audit trail. Now failures from sed and the header-strip grep emit warnings and skip the update; only the final data-row filter still uses `|| true` legitimately (empty match means "no prior rows", which is correct). - Stage body_json=$(jq -Rs ...) calls in update_run_log and upsert_details_comment. A jq failure (invalid UTF-8 in body, etc.) would otherwise silently produce body_json="" and then post an empty PATCH that wipes content. - Document the bash >= 4.4 requirement (for empty-array expansion under set -u). ubuntu-latest ships bash 5.x so this is currently enforced by the runner; the comment is for future portability.
R5 reviewers (code-reviewer, silent-failure-hunter, comment-analyzer) flagged 6 issues — all addressed. - Drop dead branch in update_run_log audit-log staging. The previous round added an `if !` wrapper around `... | g_rep -v ... || true`, but the trailing `|| true` makes the pipeline always exit 0, so the warning branch was structurally unreachable. Replace with a cleaner three-stage layout: stage 1 (sed range extract) is checked explicitly because a sed parse failure is always a real error; stages 2 and 3 (header strip + row filter) legitimately exit 1 when filters match nothing, so they share a single `|| true`. - Empty MARKER_LABEL fail-fast moved earlier — into the investigate job's "Stage skill and hooks" step. The previous check fired only in apply-actions, after the investigate job had already burned up to 50 minutes of Bedrock budget on a sed-corrupted skill. - Real bug: validate.sh resolved its schema path via $(dirname "$0")/../skill/actions-schema.json, which works in the source repo layout but breaks at runtime because hooks are staged to .claude/hooks/claude-infra-alerts/ while the schema lives at .claude/skills/claude-infra-alerts/. The relative path resolved to a non-existent .claude/hooks/skill/... Stage a copy of the schema next to the hooks too, and update validate.sh to look for actions-schema.json in its own directory. - Drop stale "weaker jq-based fallback" rationale from the YAML install-step comment and validate.sh header — that fallback was removed in round 2; the comments documented removed behavior. - Fix --body-file comment that claimed "(stdin)" but the code uses a tempfile. - SKILL.md "Available Tools" no longer claims unrestricted gh/git access; lists the actual read-only subcommands the workflow's allowlist permits.
The tightened read-only-subcommand allowlist was defense-in-depth, but the actual safety boundary is the GITHUB_TOKEN scope (issues:read in the investigate job) and the read-only IAM role. Mutations get rejected at the API boundary regardless of what's in the allowlist. The narrower form was also a maintenance burden: every read pattern Claude reasonably needed had to be enumerated, and missing entries showed up as silent investigation degradations (Claude unable to run gh repo view, gh release list, etc.). Restore Bash(gh *) and Bash(git *). SKILL.md description updated to match.
6ef3e84 to
650fece
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.github/workflows/_claude-infra-alerts.yml, a reusable workflow anypytorch/meta-pytorchrepo can call to run a Claude Code investigation over CI and infrastructure alerts and file structured diagnostic issues against a target repo.Write-guard hooks live undertools/claude-infra-alerts/. The workflow hardcodes thebedrockenvironment, model, IAM role, and the HUD MCP server; callers supply team-specific priorities, alert sources, and noise filters via thecustom-instructionsinput.meta-pytorch/pytorch-gha-infra'sclaude-infra-investigationsworkflow; the generic version keeps the high-level methodology (gather/dedup/investigate, root-cause attribution, JSON action schema) while moving team-specific alert queries and infra context out to callers.Inputs
custom-instructionsmarker-labelcreate.target-repoorg/repowhere issues are filed. Apply step rejects mismatches.default-labelscreate. Not shown to Claude.setup-commandInherited secret:
HUD_INTERNAL_BOT_TOKEN.Test plan
pytorch/ciforgetargetingpytorch/ciforgeas the issue destination, merge it, and manually trigger the workflow to confirm the reusable workflow resolves, the Claude investigation runs end-to-end, andapply-actionsposts an issue against ciforge.meta-pytorch/pytorch-gha-infraadds a caller that files againstpytorch/pytorch.Notes
@mainforpytorch/test-infracheckout per current guidance; switch to a tag once there's a second caller.gha_workflow_claude_ec2_diagnosticsneeds its OIDC trust policy to list every caller repo. Currently onlymeta-pytorch/pytorch-gha-infrais listed;pytorch/ciforge(and latermeta-pytorch/pytorch-gha-infraagain) need to be added before the caller workflow will succeed.