skill-optimizer is a Docker workbench for running and grading agent skill eval cases. The current public CLI centers on run-case and run-suite.
The workbench gives an agent an isolated Docker /work directory, captures traces, and grades deterministic local outcomes from files, command logs, generated artifacts, or other workspace state.
npm run build
npm run typecheck
npm test
npx tsx src/cli.ts --help
npx tsx src/cli.ts run-case --help
npx tsx src/cli.ts run-suite --helpsrc/cli.ts: public CLI entrypointsrc/workbench/: workbench case loading, suite loading, Docker runner, Pi agent, graders, and tracesdocker/workbench-runner.Dockerfile: generic non-root container image for setup, agent, grade, and cleanup phasesskills/skill-optimizer/SKILL.md: canonical distributable Agent Skillskills/skill-optimizer/references/workbench.md: detailed workbench schema and usage reference.claude-plugin/,.codex-plugin/,.cursor-plugin/,.opencode/: cross-agent plugin manifests and install support.agents/plugins/marketplace.json: Codex repo marketplace entry for the root plugingemini-extension.json,GEMINI.md: Gemini extension metadata and context fileexamples/workbench/: tracked example eval suitesREADME.md: provider-specific installation instructions for Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and skill-only installsCONTRIBUTING.md: contributor workflow and current workbench invariants
Keep the README installation section aligned with packaged plugin metadata:
- Claude Code:
.claude-plugin/plugin.jsonand.claude-plugin/marketplace.json - Codex:
.agents/plugins/marketplace.jsonand.codex-plugin/plugin.json - Cursor:
.cursor-plugin/plugin.jsonand.cursor/INSTALL.md - OpenCode:
.opencode/plugins/skill-optimizer.jsand.opencode/INSTALL.md - Gemini CLI:
gemini-extension.jsonandGEMINI.md - Skill-only installs:
npx skills add fastxyz/skill-optimizer --skill skill-optimizer ...
- Keep evaluation static: extraction and matching are allowed; do not execute model-produced code outside the Docker workbench as part of evaluation.
run-suiteuses models fromsuite.yml; do not add arun-suite --modelsoverride.- Keep OpenRouter model refs as
openrouter/...; real model runs requireOPENROUTER_API_KEY. - Cases use
graders: [{ name, command }]; legacycheck:andartifacts:are invalid. - Graders are the acceptance contract; evaluate outputs from
/work, generated artifacts,answer.json,trace.jsonl, and result state. - The agent phase sees only
/work, not/caseor/results. - Keep plugin metadata pointed at the canonical
skills/skill-optimizer/SKILL.md; do not create divergent skill copies. - Codex plugin metadata lives in
.codex-plugin/plugin.json; the repo marketplace lives in.agents/plugins/marketplace.jsonand points at./. - Provider install docs should link to the same canonical skill/plugin metadata, not separate skill copies.
- Do not commit
.skill-eval/,.results/,.env, or credentials.
- Run
npm run typecheckafter TypeScript changes. - Run
npm testbefore finishing behavior changes. - For Docker runner or image changes, also run
docker build -t skill-optimizer-workbench:local -f docker/workbench-runner.Dockerfile .. - For CLI/docs changes, verify
npx tsx src/cli.ts --helpif touched docs mention CLI behavior. - For plugin/package metadata changes, run
npx tsx tests/smoke-skill-distribution.tsand verifynpm pack --dry-run --jsonincludes required plugin files without result/cache directories.