You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Clustering analysis of 1,000 Copilot agent pull requests in github/gh-aw over the last 30 days (2026-05-12 → 2026-05-31). Prompts were extracted from PR titles + bodies, vectorized with TF-IDF (1–2 grams), and grouped with K-means; k=8 was selected by silhouette score.
PRs analyzed: 999 of 1,000 (1 skipped — empty body)
Iteration proxy: commits per PR (workflow turn-counts from logs were not reliably matchable per-PR — see Methodology)
Key Findings
Two themes dominate half of all agent work. "Runtime fixes & generated code" (22%) and "Safe-outputs / error handling" (19.5%) together account for ~41% of PRs. Safe-output work merges at a high 88.7%, while broader runtime fixes lag at 75.9% — the more diffuse the task, the lower the hit rate.
Smoke-test / engine-config tasks are the weak spot. Lowest success (59.6%), highest churn (avg 83 files changed, +1244/−953 lines). These are large, cross-cutting changes where the agent is most likely to be closed without merge.
The PR Sous Chef workflow is the most iterative task type — avg 8.2 commits/PR, roughly double the corpus average (4.4), yet still merges at 88.9%. High iteration here correlates with success, not failure.
Focused, well-scoped tasks win. Prompt/token-optimization (86.4%) and safe-outputs (88.7%) clusters have the smallest diffs and the highest merge rates. Narrow scope ⇒ higher success.
Tighten scope on smoke-test / engine-config tasks. This cluster's 59.6% merge rate and 83-file average diff suggest prompts are under-specified for cross-cutting changes. Split these into smaller, file-scoped sub-tasks before handing them to the agent.
Codify the safe-outputs / prompt-optimization pattern. These narrow, high-success clusters (86–89%) are the template: small diffs, clear acceptance criteria. Steer ambiguous "runtime fix" requests toward this shape.
High iteration is not inherently bad. PR Sous Chef averages 8.2 commits yet merges at 88.9% — don't cap iterations blindly; cap them where churn correlates with closure (smoke/engine config), not where it correlates with success.
Investigate the 194 closed-unmerged PRs, concentrated in the lower-success clusters, to recover recurring failure modes (likely scope-too-large and CI flakiness).
Methodology & limitations
Source:copilot-prs.json (1,000 PRs, refreshed 2026-05-31) enriched with per-PR full data (comments/reviews/commits/files) — all 1,000 had full metrics available.
Text: PR title + body, code fences/URLs/issue-refs stripped, lowercased; domain boilerplate added to stopwords.
Clustering: K-means, k chosen by silhouette over k∈[3,8]. Silhouette is low (0.10) — expected for short heterogeneous PR text; clusters are keyword-derived themes, not hard boundaries. Some centroid keywords (e.g. "firewall") reflect a dominant sub-template rather than every member.
Iteration proxy: commit count per PR. Direct workflow turn-counts (aw_info.json) were not joined — matching 1,000 PRs to individual workflow runs by timestamp is unreliable, so this was intentionally omitted rather than approximated.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
Clustering analysis of 1,000 Copilot agent pull requests in
github/gh-awover the last 30 days (2026-05-12 → 2026-05-31). Prompts were extracted from PR titles + bodies, vectorized with TF-IDF (1–2 grams), and grouped with K-means;k=8was selected by silhouette score.Key Findings
Cluster Analysis
Task complexity by cluster (chart)
Cluster theme details & representative PRs
Runtime fixes & generated code — 220 PRs · 75.9% success
General bug/runtime fixes touching generated code and model/runtime config. Largest cluster, broadest scope, mid-pack success. Examples: #31917, #35286, #35773.
Safe-outputs / error handling — 195 PRs · 88.7% success
Safe-output paths, error handling, behavior/coverage. Tight diffs (avg 18 files), highest-volume and high success. Examples: #33350, #32273, #32655.
Shared package refactors — 183 PRs · 77.6% success
Refactors into shared packages/helpers; very large additions (avg +9,227 lines, inflated by generated-code PRs). Examples: #32117, #35778, #36006.
Prompt & token optimization — 162 PRs · 86.4% success
Agent prompt tuning, token/turn reduction, guidance edits. Smallest diffs, second-highest success. Examples: #34874, #35817, #35650.
Firewall / network egress rules — 94 PRs · 74.5% success
Network/egress allow-list, triggering-command and MCP network schema work. Examples: #33240, #33386, #33683.
Smoke tests & engine config — 57 PRs · 59.6% success
Smoke tests, Claude/engine config, domain allow-lists. Lowest success, highest churn. Examples: #33273, #33852, #35802.
PR Sous Chef workflow — 54 PRs · 88.9% success
Iterations on the PR Sous Chef workflow. Most commits/PR (8.2) but high merge rate. Examples: #36088 and related Sous Chef PRs.
CI / failing Actions fixes — 34 PRs · 76.5% success
Fixing failing GitHub Actions jobs. Smallest, lowest-iteration cluster (avg 3.0 commits).
Representative PRs per cluster (data table)
label_commandrouting via `agentic_commagh aw initto create the Agentic Workflows custom aaw.ymlpackage resolution to `gh aw adsetup-gh-awinstall idempotent whengh-awis alreadnetworkschema deprecation semantics withon.pull_request_reviewer: slash_commandsynthetic trigcreate-check-runsafe output type for multi-agent PR acheckout.clean-git-credentialsto support submodule-saRecommendations
Methodology & limitations
copilot-prs.json(1,000 PRs, refreshed 2026-05-31) enriched with per-PR full data (comments/reviews/commits/files) — all 1,000 had full metrics available.aw_info.json) were not joined — matching 1,000 PRs to individual workflow runs by timestamp is unreliable, so this was intentionally omitted rather than approximated.References: §26710025727
Beta Was this translation helpful? Give feedback.
All reactions