Skip to content

Comments

feat(ai): AI Testing Framework — consolidation staging branch [0/7 → master]#411

Draft
ianwhitedeveloper wants to merge 3 commits intomasterfrom
ai-testing-framework-implementation-consolidation
Draft

feat(ai): AI Testing Framework — consolidation staging branch [0/7 → master]#411
ianwhitedeveloper wants to merge 3 commits intomasterfrom
ai-testing-framework-implementation-consolidation

Conversation

@ianwhitedeveloper
Copy link
Collaborator

Context

This is the staging branch for a structured consolidation of draft PR #394 — the Riteway AI Testing Framework. Per Eric's consolidation request, PR #394 (80+ commits, 104 files, ~21K lines, ~60% docs/planning) is being decomposed into 7 small, focused PRs — one module per PR, in dependency order — each with functional requirements and unit tests, ruthlessly reviewed before merging here.

This branch is NOT ready to merge to master. It will be when all 7 PRs are merged into it and a final review passes.


Epic

Enable riteway ai <promptfile> — a CLI command that reads SudoLang test files, delegates execution to AI agents, and outputs results in TAP format. Treats prompts as first-class testable units, supporting configurable runs, pass thresholds, parallel execution, and rich TAP markdown output.

Full requirements: tasks/2026-01-22-riteway-ai-testing-framework.md


Why Not Cherry-Pick or Rebase PR #394?

  • 80+ commits interleave multiple modules — no clean per-module slices
  • Duplicate commits from prior rebases make cherry-pick impractical
  • ~60% of changed files are docs/planning that must stay out of production PRs
  • Circular dependency (ai-runner.jstest-extractor.js) needed to be resolved first

Approach: Fresh branches from this consolidation base, copy files from the feature branch, fix WIP issues during consolidation, review each PR independently before merging here.


Dependency Graph (module architecture)

ai-errors.js  (leaf)       constants.js  (leaf)
    ↓                           ↓
debug-logger.js            tap-yaml.js
    ↓
agent-parser.js  ←  ai-errors
extraction-parser.js  ←  ai-errors
execute-agent.js  ←  ai-errors, debug-logger, agent-parser
aggregation.js  ←  ai-errors, constants
    ↓
agent-config.js  ←  ai-errors, agent-parser      [PR 4]
validation.js  ←  ai-errors, debug-logger         [PR 4]
    ↓
test-extractor.js  ←  execute-agent               [PR 5]
ai-runner.js  ←  all prior                        [PR 5]
    ↓
test-output.js                                    [PR 6]
ai-command.js  ←  all prior                       [PR 6]
bin/riteway.js  (modifications)                   [PR 6]
    ↓
e2e.test.js  +  fixtures  +  config               [PR 7]

No cycles. Every module has a colocated test file.


7-PR Progress

# PR Files Status
1 Foundation — Error Types + Constants ai-errors.js, constants.js + tests ✅ Merged (#407)
2 Utilities — Debug Logger, Concurrency, TAP YAML 5 files ✅ Merged (#408)
3 Parsers + Execute Agent agent-parser, extraction-parser, aggregation, execute-agent + tests ✅ Merged (#409)
4 Config + Validation agent-config, validation + tests + fixtures 🔍 In review (#410)
5 Test Extractor + Core Runner test-extractor, ai-runner + tests ⏳ Pending
6 Test Output + CLI Integration test-output, ai-command, bin/riteway + tests ⏳ Pending
7 E2E Tests + Fixtures + Config e2e.test.js, fixtures, vitest/eslint config ⏳ Pending

Current test count: 163 tests merged (PRs 1–3) + 23 in PR 4 review = 186 passing.


WIP Issues From Original PR (13 total)

# Issue Status
1 for (const loops in tests ✅ Zero instances — resolved
2 agent-config schema comment verbose ✅ Resolved in PR 4
3 Fixtures README outdated ⏳ PR 7
4 formatMedia dead code ✅ Decision documented — remove in PR 6
5 test-output.js dead call ✅ Removed with #4
6 Redundant test comments ✅ None found in PRs 1–4
7 Try(() => fn(args)) syntax ✅ Valid — no change
8 ai-runner logger coupling ✅ Resolved in PR 3 (injected logger)
9 unwrapRawEnvelope duplication ✅ Resolved in PR 3 (shared unwrapEnvelope)
10 Cursor agent --trust flag ✅ Resolved in PR 4
11 Hardcoded defaults in tests ✅ Explicit per TDD rules
12 Error handling/Zod placement ⏳ PR 5
13 Re-exports in test-extractor ⏳ PR 5

Open Architectural Questions (surfaced in PR 4, flagged for Eric)

Two design issues in agent-config.js are not regressions from the feature branch, but consolidation is the right moment to decide before PR 6 (CLI integration) wires up --agent-config.

1. Built-in agent configs hardcode third-party CLI flags

getAgentConfig() returns hardcoded flag arrays for claude, opencode, and cursor. If any of those CLIs rename a flag, every riteway user breaks until we ship an update.

Proposed: riteway ai init writes a starter config file to the project. Built-in defaults stay for first-run convenience; teams who want stability own their config file. Library stops being the source of truth for third-party CLI interfaces.

2. parseOutput function can't live in a JSON config file

Custom agents loaded via --agent-config my-agent.json are silently locked to default JSON stdout parsing. A custom OpenCode-compatible agent can't declare NDJSON output format in its config — because parseOutput is a JS function and JSON can't serialize functions.

Proposed: Replace the parseOutput function field with a declarative "outputFormat": "json" | "ndjson" | "text" string. Riteway maps strategy names to parsers. Schema becomes fully serializable; custom agents become fully capable.

These two changes should land as a PR 4 follow-up before PR 6 merges.


Merge Plan

  1. Each topic PR targets this branch (not master)
  2. Agent + human review before each merge
  3. When all 7 are merged here and tests are green: final review, then PR this → master

ianwhitedeveloper and others added 3 commits February 18, 2026 08:07
* feat(ai): add error types, constants, and Zod schemas (PR 1/7)

Foundation layer for the AI testing framework. Introduces structured
error handling via error-causes and runtime-validated configuration
constants via Zod schemas. Updates eslint ecmaVersion to 2022 to
support numeric separators and optional chaining used throughout
the framework source.

Files:
- source/ai-errors.js — named error types (ParseError, ValidationError, etc.)
- source/ai-errors.test.js — full coverage for error descriptors and createError
- source/constants.js — defaults, constraints, and Zod schemas
- source/constants.test.js — 26 tests covering all schemas and boundaries
- eslint.config.js — bump ecmaVersion 2017 → 2022 (prerequisite)
- package.json — add error-causes and zod to production dependencies

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(config): bring working configs from feature branch

Adds vitest.config.js e2e exclusion (source/e2e.test.js uses Riteway/Tape,
not Vitest) alongside the eslint ecmaVersion 2022 bump already in place.
Both changes are sourced from the working feature branch.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 1 review findings

- constants.js: lazy process.cwd() default (z.string().default(() => process.cwd()))
  prevents stale value when cwd changes after module load
- constants.js: add concurrencyMax (50) to constraints + enforce in concurrencySchema
- constants.js: remove JSDoc from internal constants (not public API)
- constants.test.js: add full aiTestOptionsSchema coverage (valid input, missing
  filePath, empty filePath, invalid agent, lazy cwd default, optional agentConfigPath)
- constants.test.js: add concurrencySchema upper-bound tests
- ai-errors.test.js: replace for..of loops with test.each (one named test per case)
- ai-errors.test.js: expand createError integration to cover two error types
- ai-errors.test.js: replace typeof handleAIErrors check with behavioral routing tests
- ai-errors.js: remove forward-reference comment (extraction-parser.js not yet in scope)
- eslint.config.js: Object.assign -> spread operator

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
… 2/7] (#408)

* feat(ai): Utilities — Debug Logger, Concurrency Limiter, TAP YAML [PR 2/7]

- Add createDebugLogger: console + file logging with buffer/flush
- Add limitConcurrency: sliding-window async concurrency limiter
- Add parseTAPYAML: parse judge agent TAP YAML diagnostic blocks
- Add limit-concurrency.test.js (missing from PR #394)
- Apply js.mdc cleanup: flush loop → single write, for-of → reduce pipeline
- Replace @paralleldrive/cuid2 (not in deps) with mkdtempSync in debug-logger.test.js

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(ai): apply PR 2 review suggestions

- Collapse formatMessage to concise arrow expression
- Add comment to limit-concurrency for-of loop (justified async pattern)
- Add flush no-op test when logFile is not configured
- Use vi.useFakeTimers() in concurrency-cap test for determinism

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
Add agent-parser, extraction-parser, aggregation, and execute-agent
modules with full unit test coverage.

- agent-parser: parseStringResult, parseOpenCodeNDJSON, unwrapEnvelope
  (new shared export), unwrapAgentResult. Shared unwrapEnvelope breaks
  duplication between agent-parser and execute-agent (WIP fix #9).
- extraction-parser: parseExtractionResult with multi-strategy JSON
  parsing (direct, markdown fence, pre-parsed object), and
  resolveImportPaths for prompt file resolution.
- aggregation: normalizeJudgment, calculateRequiredPasses,
  aggregatePerAssertionResults with Zod validation.
- execute-agent: extracted from ai-runner.js to break the circular
  dependency (ai-runner ↔ test-extractor). Logger injected at
  executeAgent call site rather than created inside spawnProcess
  (WIP fix #8). Uses shared unwrapEnvelope from agent-parser.
- Test files use test.each for all table-driven cases per convention.

164 tests pass, 0 lint errors, TypeScript checks pass.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 code review findings

- aggregation.js: validate once in aggregatePerAssertionResults — capture
  the Zod-validated result and compute Math.ceil inline, eliminating the
  redundant second schema parse inside calculateRequiredPasses
- aggregation.js: remove misleading optional chaining (raw?.passed etc.)
  after the null-guard throw; use plain property access
- agent-parser.js: replace acc.push() with [...acc, text] in reduce
  accumulator to prefer immutability per JS style guide
- agent-parser.test.js: drop redundant "parsed object:" prefix from
  unwrapEnvelope test.each given fields; remove duplicate standalone
  "no result key" test that overlapped with test.each row
- aggregation.test.js: remove redundant export-existence assertion for
  normalizeJudgment; add empty perAssertionResults edge case (vacuous
  truth — every() on [] returns true)
- execute-agent.test.js: strengthen parseOutput test to verify stdout
  and logger are threaded through as expected (documents WIP fix #8)

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 author review findings

- aggregation.js: rename `raw` param to `judgeResponse` and
  fold into single options object for normalizeJudgment; removes
  the two-argument signature (breaking change, callers updated)
- aggregation.js: remove calculateRequiredPasses — math is inlined
  in aggregatePerAssertionResults, eliminating double schema parse
- aggregation.test.js: remove calculateRequiredPasses describe block;
  fix Try() usage (direct fn ref, not arrow wrapper); update all
  normalizeJudgment call sites to new single-options signature
- execute-agent.js: extract magic number 500 to maxOutputPreviewLength
  constant (camelCase per javascript.mdc); applied to all 3 truncation sites
- execute-agent.test.js: replace try/catch antipatterns with await Try();
  add Try import from riteway.js
- extraction-parser.test.js: strengthen weak typeof assertions to check
  specific fields; strengthen cause !== undefined to cause.name === SyntaxError

151 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* fix(ai): address PR 3 follow-up review findings

- constants.js: rename calculateRequiredPassesSchema to
  aggregationParamsSchema — name now reflects what the schema
  validates (aggregation input params) rather than the deleted
  calculateRequiredPasses function; update all import sites
- aggregation.test.js: add 6 missing Zod validation edge cases
  for aggregatePerAssertionResults (zero runs, negative runs,
  non-integer runs, NaN runs, negative threshold, NaN threshold)
  — coverage gap introduced when calculateRequiredPasses and its
  tests were removed; all cases now exercised via
  aggregatePerAssertionResults test.each

157 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(test): complete PR review remediation

🐛 - Remove weak instanceof Error assertions

🔄 - Add threshold calculation verification tests

Tests now verify threshold-based pass/fail logic directly

164 tests passing, 0 lint errors, TypeScript clean

Co-authored-by: Ian White <ian.white.developer@gmail.com>

* fix(ai): remove implementation detail from test

- execute-agent.test.js: remove logger type assertion from
  parseOutput test — typeof checks violate tdd.mdc:64 and
  logger threading is an implementation detail; the three
  remaining assertions (call count, stdout arg, parsed result)
  collectively verify correct integration

164 tests pass, 0 lint errors, TypeScript clean.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
@ericelliott
Copy link
Collaborator

I'm okay with the strategies here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants