You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Capture final SDK result metadata in eval-runner output: resultSubtype, resultIsError, resultStopReason, resultText, and resultErrors.
Treat empty successful completions as failed evals when the SDK reports success but the run produced neither assistant text nor final result text.
Add opt-in compact transcript diagnostics behind STUDIO_EVAL_INCLUDE_TRANSCRIPT=1, with text/tool-result truncation to avoid bloating default artifacts.
Why
This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase/tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, PI harness continuation behavior, runner classification, and downstream benchmark quality gates.
The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (restaurant, wordpress-is-dead): site_list / site_info returned successfully, then the SDK emitted subtype: \"success\", stopReason: \"end_turn\", and an empty result. No assistant text, Write, wp_cli, or import report was produced, but the eval runner classified the run as successful because it trusted message.subtype === 'success'.
With the local transcript diagnostics enabled, that failure shape was clear. The same diagnostics also helped compare Claude Sonnet 4.6 on the same SSI site-build flow: Claude generated source HTML and wrote files, then timed out while repairing generated helper-script errors before reaching import. Different failure mode, same need for better eval evidence.
This PR keeps the transcript opt-in so normal eval artifacts only gain a few scalar fields, while deeper debugging remains available when investigating model/runtime/harness regressions.
Validation
npm install to bootstrap the clean worktree.
npm run cli:build --silent — passed.
npx eslint apps/cli/ai/eval-runner.ts — passed.
npm -w wp-studio run typecheck — passed.
git diff --check — passed.
AI assistance
AI assistance: Yes
Tool(s): OpenCode (GPT-5.5)
Used for: Diagnosing the eval-runner artifact gap during SSI benchmark runs, drafting the result metadata / opt-in transcript patch, running local validation, and preparing this PR body. Chris reviewed the failure evidence and PR framing.
#3360 (AI sessions: adopt pi-coding-agent SessionManager end-to-end) reshaped apps/cli/ai/eval-runner.ts around the new AgentSessionEvent model. Every event-handling code path this PR touched changed shape — final resultSDKMessage, subtype, stop_reason, is_error, errors all gone in favor of findLastAssistant(event.messages) and getAgentEndTurnResult(). Couldn't be rebased; rewrote it in #3376 with the same goals (false-success detection plus opt-in transcript) against the new event surface, validated end-to-end including a real multi-turn mutating run.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
resultSubtype,resultIsError,resultStopReason,resultText, andresultErrors.STUDIO_EVAL_INCLUDE_TRANSCRIPT=1, with text/tool-result truncation to avoid bloating default artifacts.Why
This continues the eval-runner observability work from #3273 and #3330. Those PRs made phase/tool timings, first tool errors, loop exceptions, and timeouts visible. This adds the final SDK result shape and an opt-in turn transcript so eval consumers can distinguish model behavior, PI harness continuation behavior, runner classification, and downstream benchmark quality gates.
The need surfaced while testing the Static Site Importer draft path in #3309 with the Studio site-build benchmark. GPT-5.5 repeatedly produced tool-only runs for built-in prompt variants (
restaurant,wordpress-is-dead):site_list/site_inforeturned successfully, then the SDK emittedsubtype: \"success\",stopReason: \"end_turn\", and an emptyresult. No assistant text,Write,wp_cli, or import report was produced, but the eval runner classified the run as successful because it trustedmessage.subtype === 'success'.With the local transcript diagnostics enabled, that failure shape was clear. The same diagnostics also helped compare Claude Sonnet 4.6 on the same SSI site-build flow: Claude generated source HTML and wrote files, then timed out while repairing generated helper-script errors before reaching import. Different failure mode, same need for better eval evidence.
This PR keeps the transcript opt-in so normal eval artifacts only gain a few scalar fields, while deeper debugging remains available when investigating model/runtime/harness regressions.
Validation
npm installto bootstrap the clean worktree.npm run cli:build --silent— passed.npx eslint apps/cli/ai/eval-runner.ts— passed.npm -w wp-studio run typecheck— passed.git diff --check— passed.AI assistance