CLI: add eval-runner result diagnostics by chubes4 · Pull Request #3349 · Automattic/studio

chubes4 · 2026-05-05T19:37:06Z

Summary

Capture final SDK result metadata in eval-runner output: resultSubtype, resultIsError, resultStopReason, resultText, and resultErrors.
Treat empty successful completions as failed evals when the SDK reports success but the run produced neither assistant text nor final result text.
Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1, with text/tool-result truncation to avoid bloating default artifacts.

Why

This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase/tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, PI harness continuation behavior, runner classification, and downstream benchmark quality gates.

The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (restaurant, wordpress-is-dead): site_list / site_info returned successfully, then the SDK emitted subtype: \"success\", stopReason: \"end_turn\", and an empty result. No assistant text, Write, wp_cli, or import report was produced, but the eval runner classified the run as successful because it trusted message.subtype === 'success'.

With the local transcript diagnostics enabled, that failure shape was clear. The same diagnostics also helped compare Claude Sonnet 4.6 on the same SSI site-build flow: Claude generated source HTML and wrote files, then timed out while repairing generated helper-script errors before reaching import. Different failure mode, same need for better eval evidence.

This PR keeps the transcript opt-in so normal eval artifacts only gain a few scalar fields, while deeper debugging remains available when investigating model/runtime/harness regressions.

Validation

npm install to bootstrap the clean worktree.
npm run cli:build --silent — passed.
npx eslint apps/cli/ai/eval-runner.ts — passed.
npm -w wp-studio run typecheck — passed.
git diff --check — passed.

AI assistance

AI assistance: Yes
Tool(s): OpenCode (GPT-5.5)
Used for: Diagnosing the eval-runner artifact gap during SSI benchmark runs, drafting the result metadata / opt-in transcript patch, running local validation, and preparing this PR body. Chris reviewed the failure evidence and PR framing.

wpmobilebot · 2026-05-05T20:01:31Z

📊 Performance Test Results

Comparing 4fa17d4 vs trunk

app-size

Metric	trunk	`4fa17d4`	Diff	Change
App Size (Mac)	1454.03 MB	1454.03 MB	+0.00 MB	⚪ 0.0%

site-editor

Metric	trunk	`4fa17d4`	Diff	Change
load	1522 ms	1516 ms	6 ms	⚪ 0.0%

site-startup

Metric	trunk	`4fa17d4`	Diff	Change
siteCreation	8078 ms	8078 ms	0 ms	⚪ 0.0%
siteStartup	4946 ms	4939 ms	7 ms	⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

chubes4 · 2026-05-06T15:13:26Z

Will update this after #3360 lands.

chubes4 · 2026-05-07T15:38:30Z

Closing in favor of #3376.

#3360 (AI sessions: adopt pi-coding-agent SessionManager end-to-end) reshaped apps/cli/ai/eval-runner.ts around the new AgentSessionEvent model. Every event-handling code path this PR touched changed shape — final result SDKMessage, subtype, stop_reason, is_error, errors all gone in favor of findLastAssistant(event.messages) and getAgentEndTurnResult(). Couldn't be rebased; rewrote it in #3376 with the same goals (false-success detection plus opt-in transcript) against the new event surface, validated end-to-end including a real multi-turn mutating run.

CLI: add eval-runner result diagnostics

4fa17d4

github-actions Bot assigned chubes4 May 5, 2026

chubes4 requested a review from youknowriad May 5, 2026 19:41

chubes4 mentioned this pull request May 7, 2026

CLI: add eval-runner result diagnostics #3376

Open

chubes4 closed this May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI: add eval-runner result diagnostics#3349

CLI: add eval-runner result diagnostics#3349
chubes4 wants to merge 1 commit into
trunkfrom
eval-runner-result-diagnostics

chubes4 commented May 5, 2026

Uh oh!

wpmobilebot commented May 5, 2026

Uh oh!

chubes4 commented May 6, 2026

Uh oh!

chubes4 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chubes4 commented May 5, 2026

Summary

Why

Validation

AI assistance

Uh oh!

wpmobilebot commented May 5, 2026

📊 Performance Test Results

app-size

site-editor

site-startup

Uh oh!

chubes4 commented May 6, 2026

Uh oh!

chubes4 commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants