[prompt-clustering] Copilot Agent Prompt Clustering — 30-Day Analysis (1,000 PRs, 8 task themes) #36103

2026-05-31T10:45:30Z

github-actions[bot]
Bot May 31, 2026

Summary

Clustering analysis of 1,000 Copilot agent pull requests in github/gh-aw over the last 30 days (2026-05-12 → 2026-05-31). Prompts were extracted from PR titles + bodies, vectorized with TF-IDF (1–2 grams), and grouped with K-means; k=8 was selected by silhouette score.

PRs analyzed: 999 of 1,000 (1 skipped — empty body)
Overall merge success rate: 80.1% (800 merged · 194 closed · 5 open)
Clusters identified: 8 coherent task themes
Iteration proxy: commits per PR (workflow turn-counts from logs were not reliably matchable per-PR — see Methodology)

Key Findings

Two themes dominate half of all agent work. "Runtime fixes & generated code" (22%) and "Safe-outputs / error handling" (19.5%) together account for ~41% of PRs. Safe-output work merges at a high 88.7%, while broader runtime fixes lag at 75.9% — the more diffuse the task, the lower the hit rate.
Smoke-test / engine-config tasks are the weak spot. Lowest success (59.6%), highest churn (avg 83 files changed, +1244/−953 lines). These are large, cross-cutting changes where the agent is most likely to be closed without merge.
The PR Sous Chef workflow is the most iterative task type — avg 8.2 commits/PR, roughly double the corpus average (4.4), yet still merges at 88.9%. High iteration here correlates with success, not failure.
Focused, well-scoped tasks win. Prompt/token-optimization (86.4%) and safe-outputs (88.7%) clusters have the smallest diffs and the highest merge rates. Narrow scope ⇒ higher success.

Cluster Analysis

Cluster (theme)	PRs	%	Success	Avg commits	Avg files	Avg +/-	Top keywords
Runtime fixes & generated code	220	22.0%	75.9%	3.9	48.9	+577/-439	fix, generated, updated, runtime, model
Safe-outputs / error handling	195	19.5%	88.7%	4.9	18.3	+251/-73	safe, output, path, behavior, coverage
Shared package refactors	183	18.3%	77.6%	4.0	46.3	+9227/-401	shared, package, helper, function, removed
Prompt & token optimization	162	16.2%	86.4%	3.7	17.2	+238/-116	prompt, agent, experiment, guidance, token
Firewall / network egress rules	94	9.4%	74.5%	4.1	15.9	+246/-191	block triggering, triggering command, http block, command http, triggering
Smoke tests & engine config	57	5.7%	59.6%	5.3	83.3	+1244/-953	claude, smoke claude, domains, smoke, id
PR Sous Chef workflow	54	5.4%	88.9%	8.2	38.9	+676/-194	sous chef, sous, chef, id, gpt
CI / failing Actions fixes	34	3.4%	76.5%	3.0	39.9	+204/-46	failing github, actions job, fix failing, github actions, job

Task complexity by cluster (chart)

Cluster theme details & representative PRs

Runtime fixes & generated code — 220 PRs · 75.9% success

General bug/runtime fixes touching generated code and model/runtime config. Largest cluster, broadest scope, mid-pack success. Examples: #31917, #35286, #35773.

Safe-outputs / error handling — 195 PRs · 88.7% success

Safe-output paths, error handling, behavior/coverage. Tight diffs (avg 18 files), highest-volume and high success. Examples: #33350, #32273, #32655.

Shared package refactors — 183 PRs · 77.6% success

Refactors into shared packages/helpers; very large additions (avg +9,227 lines, inflated by generated-code PRs). Examples: #32117, #35778, #36006.

Prompt & token optimization — 162 PRs · 86.4% success

Agent prompt tuning, token/turn reduction, guidance edits. Smallest diffs, second-highest success. Examples: #34874, #35817, #35650.

Firewall / network egress rules — 94 PRs · 74.5% success

Network/egress allow-list, triggering-command and MCP network schema work. Examples: #33240, #33386, #33683.

Smoke tests & engine config — 57 PRs · 59.6% success

Smoke tests, Claude/engine config, domain allow-lists. Lowest success, highest churn. Examples: #33273, #33852, #35802.

PR Sous Chef workflow — 54 PRs · 88.9% success

Iterations on the PR Sous Chef workflow. Most commits/PR (8.2) but high merge rate. Examples: #36088 and related Sous Chef PRs.

CI / failing Actions fixes — 34 PRs · 76.5% success

Fixing failing GitHub Actions jobs. Smallest, lowest-iteration cluster (avg 3.0 commits).

Representative PRs per cluster (data table)

PR #	Title	Cluster	Outcome	Commits	Files
#31917	Add decentralized `label_command` routing via `agentic_comma	Runtime fixes & generated code	Merged	22	30
#35286	Centralize compiler enterprise env controls, expand GH_AW_DE	Runtime fixes & generated code	Merged	19	244
#35773	Update `gh aw init` to create the Agentic Workflows custom a	Runtime fixes & generated code	Merged	17	10
#33350	feat(safe-outputs): add required-labels/required-title-prefi	Safe-outputs / error handling	Merged	28	70
#32273	Allow unsigned PR branch pushes when signed commits are disa	Safe-outputs / error handling	Merged	24	17
#32655	Add title-based deduplication to create_issue safe outputs (	Safe-outputs / error handling	Merged	18	14
#32117	Add repository-root `aw.yml` package resolution to `gh aw ad	Shared package refactors	Merged	31	21
#35778	Extend aw.yml to support includes, skills, and agents	Shared package refactors	Merged	13	10
#36006	refactor(pkg/cli): partial decomposition of largefunc linter	Shared package refactors	Closed	11	5272
#34874	Add inline skill extraction/runtime support mirroring inline	Prompt & token optimization	Merged	18	29
#35817	Refactor Agentic Workflows routing: move dispatch index to s	Prompt & token optimization	Merged	12	5
#35650	docs: executive-first daily outcome report with lifecycle he	Prompt & token optimization	Merged	11	4
#33240	Make `setup-gh-aw` install idempotent when `gh-aw` is alread	Firewall / network egress rules	Merged	16	44
#33386	Align stdio MCP `network` schema deprecation semantics with	Firewall / network egress rules	Closed	15	7
#33683	refactor(parser): break up oversized functions in pkg/parser	Firewall / network egress rules	Closed	13	7
#33273	Add `on.pull_request_reviewer: slash_command` synthetic trig	Smoke tests & engine config	Merged	39	257
#33852	Add `create-check-run` safe output type for multi-agent PR a	Smoke tests & engine config	Merged	17	29
#35802	[awf] Fix tool-cache mount handling, smoke-pi runtime config	Smoke tests & engine config	Merged	17	254
#32819	Add `checkout.clean-git-credentials` to support submodule-sa	PR Sous Chef workflow	Merged	29	13
#34554	Migrate the Gemini engine to Antigravity CLI	PR Sous Chef workflow	Closed	19	446
#34837	Move model alias/multiplier propagation from step env to act	PR Sous Chef workflow	Merged	18	258
#33165	[WIP] Fix failing GitHub Actions job JS Tests (shard 4/4)	CI / failing Actions fixes	Merged	6	3
#35200	[WIP] Fix failing GitHub Actions job Integration: CLI MCP Ot	CI / failing Actions fixes	Merged	6	2
#32096	fix: mask otlp headers	CI / failing Actions fixes	Merged	6	2

Recommendations

Tighten scope on smoke-test / engine-config tasks. This cluster's 59.6% merge rate and 83-file average diff suggest prompts are under-specified for cross-cutting changes. Split these into smaller, file-scoped sub-tasks before handing them to the agent.
Codify the safe-outputs / prompt-optimization pattern. These narrow, high-success clusters (86–89%) are the template: small diffs, clear acceptance criteria. Steer ambiguous "runtime fix" requests toward this shape.
High iteration is not inherently bad. PR Sous Chef averages 8.2 commits yet merges at 88.9% — don't cap iterations blindly; cap them where churn correlates with closure (smoke/engine config), not where it correlates with success.
Investigate the 194 closed-unmerged PRs, concentrated in the lower-success clusters, to recover recurring failure modes (likely scope-too-large and CI flakiness).

Methodology & limitations

Source: copilot-prs.json (1,000 PRs, refreshed 2026-05-31) enriched with per-PR full data (comments/reviews/commits/files) — all 1,000 had full metrics available.
Text: PR title + body, code fences/URLs/issue-refs stripped, lowercased; domain boilerplate added to stopwords.
Vectorization: TF-IDF, max 400 features, 1–2 grams, min_df=3, max_df=0.6, sublinear TF.
Clustering: K-means, k chosen by silhouette over k∈[3,8]. Silhouette is low (0.10) — expected for short heterogeneous PR text; clusters are keyword-derived themes, not hard boundaries. Some centroid keywords (e.g. "firewall") reflect a dominant sub-template rather than every member.
Iteration proxy: commit count per PR. Direct workflow turn-counts (aw_info.json) were not joined — matching 1,000 PRs to individual workflow runs by timestamp is unreliable, so this was intentionally omitted rather than approximated.

References: §26710025727

Generated by 📊 Copilot Agent Prompt Clustering Analysis · opus48 1.4M · ◷

expires on Jun 1, 2026, 10:45 AM UTC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[prompt-clustering] Copilot Agent Prompt Clustering — 30-Day Analysis (1,000 PRs, 8 task themes) #36103

Uh oh!

{{title}}

Uh oh!

Runtime fixes & generated code — 220 PRs · 75.9% success

Safe-outputs / error handling — 195 PRs · 88.7% success

Shared package refactors — 183 PRs · 77.6% success

Prompt & token optimization — 162 PRs · 86.4% success

Firewall / network egress rules — 94 PRs · 74.5% success

Smoke tests & engine config — 57 PRs · 59.6% success

PR Sous Chef workflow — 54 PRs · 88.9% success

CI / failing Actions fixes — 34 PRs · 76.5% success

Replies: 0 comments

Select a reply

Uh oh!

[prompt-clustering] Copilot Agent Prompt Clustering — 30-Day Analysis (1,000 PRs, 8 task themes) #36103

Uh oh!

github-actions[bot] Bot May 31, 2026

Summary

Key Findings

Cluster Analysis

Runtime fixes & generated code — 220 PRs · 75.9% success

Safe-outputs / error handling — 195 PRs · 88.7% success

Shared package refactors — 183 PRs · 77.6% success

Prompt & token optimization — 162 PRs · 86.4% success

Firewall / network egress rules — 94 PRs · 74.5% success

Smoke tests & engine config — 57 PRs · 59.6% success

PR Sous Chef workflow — 54 PRs · 88.9% success

CI / failing Actions fixes — 34 PRs · 76.5% success

Recommendations

Replies: 0 comments

github-actions[bot]
Bot May 31, 2026