From requirement to production-grade code — planned, tested, verified.
Spec-driven plans. Enforced quality gates. Persistent knowledge.
Install • Features • Docs • Blog • Website • Changelog
curl -fsSL https://raw.githubusercontent.com/maxritter/pilot-shell/main/install.sh | bashmacOS · Linux · Windows (WSL2) — installs in under 2 minutes.
Claude Code and Codex CLI write code fast — but without structure, they skip tests, lose context, and produce inconsistent results. Other frameworks add complexity (dozens of agents, thousands of lines of config) without meaningfully better output.
Pilot Shell is different. Every component solves a real problem with an engineered solution:
/prd— brainstorm ideas into clear requirements with optional deep research/spec— plans, implements, and verifies features end-to-end with TDD/fix— bugfix workflow with TDD; bails out when complexity exceeds the standard fix lane- Spec collaboration — share specs with teammates, annotations flow back grouped by author
- Quality hooks — enforce linting, formatting, type checking, and tests as quality gates
- Context engineering — preserves decisions and knowledge across sessions
- Code intelligence — semantic search (Semble) + code knowledge graph (CodeGraph)
- Token optimization — 60–90% cost reduction via RTK compression and Semble code search
- Pilot Bot — persistent automation agent with scheduled tasks and background jobs
- Extensions — reusable rules, skills, and MCP servers with team sharing and customization
- Console — local web dashboard with real-time notifications and session management
At least one AI agent: Pilot Shell supports Claude Code (primary — full feature coverage) and Codex CLI (all workflows, fewer platform features). Install at least one before running the Pilot installer:
- Claude Code: Install via the native installer. If you have the
npmorbrewversion, uninstall it first. Requires a Claude subscription — Max 5x or 20x for solo, Team Premium for teams, Enterprise for organizations. - Codex CLI: Install via the native installer. If you have the
npmorbrewversion, uninstall it first. Requires an OpenAI subscription — Plus or Pro for solo, Business or Enterprise for teams.
Terminal (Recommended): cmux works great with Pilot Shell — its vertical tab layout lets you run multiple sessions side by side. Any modern terminal works: Ghostty, iTerm2, or the built-in macOS/Linux terminal.
Works with any existing project. Pilot Shell integrates with Claude Code and Codex CLI, using their built-in concepts (rules, hooks, skills, subagents, MCP) to improve your experience:
curl -fsSL https://raw.githubusercontent.com/maxritter/pilot-shell/main/install.sh | bashInstalls globally on macOS, Linux, and Windows (WSL2). After installation, just run claude or codex directly — Pilot Shell is loaded automatically. Run pilot update to check for updates.
Downgrade
If you encounter an issue or unfixed bug in the latest version, you can always go back to a previous version (see releases):
export VERSION=9.1.3
curl -fsSL https://raw.githubusercontent.com/maxritter/pilot-shell/main/install.sh | bashUninstalling
Removes the Pilot binary, plugin files, managed commands/rules, settings and shell aliases:
curl -fsSL https://raw.githubusercontent.com/maxritter/pilot-shell/main/uninstall.sh | bashReset & Refresh
Over time, accumulated session logs and Pilot Shell's caches can slow things down. A periodic reset gives you a clean baseline:
# 1. If using Claude Code, log out first
/logout
# 2. Back up your current config (just in case)
mv ~/.claude.json ~/.claude.json.bak
mv ~/.claude ~/.claude.bak
mv ~/.codex ~/.codex.bak
mv ~/.pilot ~/.pilot.bak
# 3. Reinstall Pilot Shell from the official installer
curl -fsSL https://raw.githubusercontent.com/maxritter/pilot-shell/main/install.sh | bash
# 4. Re-activate your license, then start your agent
pilot activate <your-license-key>
claude # or: codexOnce Pilot Shell is running smoothly again, you can delete the .bak copies. Forgot your license key? Recover it in the Pilot members area.
Using a Dev Container
Pilot Shell works inside Dev Containers. Copy the .devcontainer folder from this repository into your project, adapt it to your needs (base image, extensions, dependencies), and run the installer inside the container. The installer auto-detects the container environment and skips system-level dependencies like Homebrew.
For tighter isolation when working with untrusted code, combine the dev container with Claude Code's /sandbox — bubblewrap, socat, iptables, and ipset are pre-installed in the Dockerfile so it works out of the box on Linux. See Anthropic's development containers and sandboxing docs for hardening patterns (egress allowlist, managed settings, persistent volumes).
What the installer does
9-step installer with progress tracking, rollback on failure, and idempotent re-runs. Steps 3 and 4 are agent-conditional — they skip cleanly when the matching agent CLI is not detected. The installer does not install Claude Code or Codex CLI itself; install at least one yourself per the prerequisites above.
- Prerequisites — Checks/installs Homebrew, Node.js, Python 3.12+, uv, git, jq. Verifies at least one supported agent (Claude Code or Codex CLI) is on the system; aborts with a clear error otherwise.
- Pilot files — Agent-neutral Pilot Shell-managed assets. Hooks →
~/.pilot/hooks/, Console scripts/UI →~/.pilot/, MCP server template →~/.pilot/.mcp.json, raw rule sources →~/.pilot/rules/(read by Codex's adapter), plus the canonical skill source at~/.claude/skills/(Claude reads natively; Codex adapts in step 4). Always runs. - Claude files — Claude-specific assets under
~/.claude/: rules, sub-agents,settings.json(three-way merged), plus the Claude post-install merges (hooks into settings,~/.claude.jsonMCP block, model config migration). Skipped when Claude Code CLI is not detected. - Codex files — Codex-specific assets: adapted skills →
~/.agents/skills/, review agents →~/.codex/agents/,~/.codex/AGENTS.md,~/.codex/config.toml,~/.codex/hooks.json. Per-category counts mirror the Claude section. Skipped when Codex CLI is not detected. - Config files — Creates
.nvmrcand project config. - Dependencies — Installs Semble, RTK, CodeGraph, Chrome DevTools MCP, playwright-cli, agent-browser, language servers, and the
codex@openai-codexClaude marketplace plugin (skipped on Codex-only systems alongside Chrome DevTools MCP and LSP plugins). - Shell integration — Auto-configures bash, fish, and zsh with the
pilotadmin alias. - VS Code extensions — Installs recommended extensions for your stack.
- Finalize — Success message with next steps.
Run these commands once in each new project after installing Pilot Shell:
# Claude Code # Codex CLI
claude codex
> /setup-rules > $setup-rules/setup-rules reads your codebase, discovers your conventions, and generates project-specific rules and MCP server docs — this is how Pilot learns your project. Run it once to start, then again after major architectural changes.
Once your rules are in place, use /create-skill to capture any repeatable workflow as a reusable skill, and /benchmark to measure whether your rules and skills are actually improving outputs. See Additional Workflows for full details on all three.
Three commands cover the full development cycle — from vague idea to shipped feature. Quality hooks and TDD enforcement run automatically on every task.
/prd is the brainstorming surface for ideas that aren't specs yet — vague problem statements and fuzzy shapes. It pitches directions, pressure-tests them with you, and converges on a PRD you can hand to /spec. PRDs are saved to docs/prd/ and visible in the Console's Requirements tab.
# Claude Code # Codex CLI
claude codex
> /prd "Add real-time notifications for team updates" > $prd "Add real-time notifications for team updates"How /prd works
When to use /prd over /spec: /prd is for what and why; /spec is for how. Reach for /prd first when you only have a problem statement, want to riff across multiple directions, or need scope boundaries defined before someone starts building.
Flow: two modes, picked automatically from how fuzzy the idea is:
- Ideate — free-form prose, the agent pitches 3-5 directions, you react (only runs when the idea is vague)
- Clarify → Converge → Write — structured multiple-choice questions once the shape is known, then the PRD is written
Research tiers (picked at the start):
| Tier | Behavior |
|---|---|
| Quick | Skip research |
| Standard | Web search for competitors, prior art, best practices |
| Deep | Parallel research agents for comprehensive findings |
The final PRD covers problem statement, core user flows, scope boundaries, and technical context — then offers to hand off directly to /spec for implementation.
/spec is for new features, refactoring, and architectural work. It provides a complete planning workflow with TDD, verification, and code review (on Claude Code, it replaces the built-in plan mode at Shift+Tab). Collaborative spec review shifts review left — share a single link, teammates annotate inline, feedback flows back into the Console grouped by author.
# Claude Code # Codex CLI
claude codex
> /spec "Add user authentication with OAuth and JWT" > $spec "Add user authentication with OAuth and JWT"Discuss → Plan → Approve → Implement (TDD) → Verify → Done
↑ ↓
└── Loop──┘
How /spec works
/spec auto-detects whether the request is a feature or a bugfix and routes to the right workflow. The three phases below apply to both — the verify step differs slightly (features get E2E scenarios; bugfixes get a Behavior Contract audit, see the /fix section below).
Plan: Explores codebase with semantic search → asks clarifying questions → writes detailed spec with scope, tasks, and definition of done → for UI features, writes E2E test scenarios (step-by-step, browser-executable) that become the verification contract → spec-review sub-agent validates completeness in Claude Code or Codex → waits for your approval. Optional Codex Companion Reviewers provide a Claude Code plugin second opinion when enabled.
Implement: Creates an isolated git worktree → implements each task with strict TDD (RED → GREEN → REFACTOR) → quality hooks auto-lint, format, and type-check every edit → full test suite after each task.
Verify: Full test suite + actual program execution → changes-review sub-agent in Claude Code or Codex (compliance + quality + goal) → for UI features, executes each E2E scenario step-by-step via browser automation (pass/fail tracked, results written to plan) → auto-fixes findings → squash merges to main on success.
/fix is the bugfix command. Investigate the bug, write the failing test, fix at the root cause, single-pass audit, done. No plan file, no approval mid-flow, no separate verify phase.
# Claude Code # Codex CLI
claude codex
> /fix "annotation persistence drops fields between save and reload" > $fix "annotation persistence drops fields"Investigate → RED → Fix → Audit → Quality Gate → Done
If investigation reveals the bug is multi-component or architectural, /fix stops cleanly and tells you to re-invoke with /spec. /fix is always quick; /spec is the full workflow.
How /fix works
For local bugs. Single file, obvious-once-traced root cause. No plan file, no approval mid-flow, no separate verify phase. TDD still enforced — bugfixes without a failing test don't ship.
- Investigate: Reproduce the bug → trace to root cause at
file:linewithcodegraph_context(structure) +semble search(intent, cross-language) + targeted reads → state confidence (High/Medium required to proceed). For UI / async / race bugs, add temporarySPEC-DEBUG:-marked logs at component boundaries before tracing. - RED: Write the failing test via an existing public entry point → run, must fail with the documented symptom.
- Fix: Minimal change at the root cause. Symptom patches are forbidden. Reproducing test must pass, then the targeted test module. Diff sanity check (root-cause file in diff, no unplanned files, < 20 lines, symptom-patching grep) catches issues with the fix itself.
- Verify End-to-End: The primary correctness signal. Run the actual program with the original input (browser automation for UI — Claude Code uses its Chrome extension; Codex uses Chrome DevTools MCP; both fall back to playwright-cli / agent-browser. CLI / API / REPL / job trigger for non-UI) and capture concrete evidence. A passing unit test alone is never accepted as proof.
- Quality Gate: Lint + types + build + full anti-regression suite, once.
When to use /spec for bugs instead: bugs that span 3+ files, need a written plan and approval, warrant a Behavior Contract (Given / When / Currently / Expected), or have failed two fix attempts. /spec adds a revert-test proof in verify, a cap at 3 iterations, and a code review gate — use it when the complexity makes that structure worthwhile.
Run after installing Pilot Shell to configure your environment, then on demand as your project evolves.
/setup-rules explores your codebase, discovers conventions, generates modular rules and documents MCP servers. Run once initially, then anytime your project changes significantly.
# Claude Code # Codex CLI
claude codex
> /setup-rules > $setup-rulesHow /setup-rules works
12 phases that read your codebase and produce comprehensive AI context:
- Reference — load best practices for rule structure, path-scoping, and quality standards
- Read existing rules — inventory all
.claude/rules/files, detect structure and path-scoping. Also detectsCLAUDE.mdandAGENTS.md(the cross-framework agent context file used by Codex, Cursor, etc.) - Migrate unscoped assets — prefix with project slug for better sharing
- Quality audit — check rules against best practices (size, specificity, stale references, conflicts)
- Explore codebase — hybrid semantic+lexical search with Semble, structural analysis with CodeGraph
- Compare patterns — discovered vs documented conventions
- Sync project rule — update
{slug}-project.mdwith current tech stack, structure, commands. MigratesCLAUDE.md/AGENTS.mdcontent into modular rules - Sync MCP docs — smoke-test user MCP servers, document working tools
- Discover new rules — find undocumented patterns worth capturing
- Cross-check — validate all references, ensure consistency across generated files
- Sync AGENTS.md — if
AGENTS.mdalready exists, offer to re-export the updated rules into it so non-Claude agents see the same context. Always asks first, never creates the file if absent, preserves user-authored sections - Summary — report all changes made
For monorepos: Organizes rules in nested subdirectories by product and team, with paths frontmatter to scope rules to specific file types. Generates a README.md documenting the structure.
/create-skill builds a reusable skill from any topic — explores the codebase and creates it interactively with you. If no topic is given, evaluates the current session for extractable knowledge.
# Claude Code # Codex CLI
claude codex
> /create-skill "Automate PR Bot comment review" > $create-skill "Automate PR Bot comment review"How /create-skill works
6 phases that turn domain knowledge into a reusable skill:
- Reference — load use case categories, complexity spectrum, file structure template, description formula, security restrictions
- Understand — explore the codebase for relevant patterns, ask clarifying questions, or evaluate the current session for extractable knowledge
- Check existing — search project and global skills to avoid duplicates
- Create — write to the active agent's skills directory (
.claude/skills/or~/.claude/skills/for Claude Code;~/.agents/skills/for Codex), apply portability and determinism checklists - Quality gates — structure checklist (SKILL.md naming, frontmatter fields), content checklist (error handling, examples, exclusions), triggering test (should/shouldn't trigger), iteration signals
- Test & iterate — run test prompts with sub-agents, evaluate results, optimize description triggering
Use case categories:
| Category | Best For |
|---|---|
| Document & Asset Creation | Consistent reports, designs, code with embedded style guides and templates |
| Workflow Automation | Multi-step processes with validation gates and iterative refinement |
| MCP Enhancement | Workflow guidance on top of MCP tool access, multi-MCP coordination |
Skill structure: Each skill is a folder with a SKILL.md file (case-sensitive), optional scripts/, references/, and assets/ directories. The YAML frontmatter description determines when the agent loads the skill — it must include what the skill does, when to use it, and specific trigger phrases. Progressive disclosure keeps context lean: frontmatter loads always (~100 tokens), SKILL.md loads on activation, linked files load on demand.
/benchmark runs your prompts with and without the target, grades outputs against falsifiable assertions, and shows a structured report you can absorb in 30 seconds — labeled verdict, quadrant breakdown, and only the divergent assertions in the drill-down. Finishes with a concrete improvement plan so you know exactly what to change next.
# Claude Code # Codex CLI
claude codex
> /benchmark pilot/skills/create-skill > $benchmark pilot/skills/create-skill
> /benchmark pilot/rules/testing.md > $benchmark pilot/rules/testing.mdHow /benchmark works
Six phases turn a rule or skill into a before/after comparison with an actionable plan:
-
Intake — pick up an existing
benchmarks/<target>/evals.jsonor author one -
Target discovery — classify as
skillorrules -
Author evals — draft 3 falsifiable assertions; falsifiability gate ensures baseline actually fails
-
Execute — run both configs in isolated sandboxes; grader subagent scores every assertion
-
Present findings — three layers, scannable top-to-bottom:
Layer Content Verdict One labeled sentence with a recommended next step. Delta bands: 🟢 Strong (≥ +0.50) / 🟢 Moderate (+0.20) / 🟡 Weak (+0.05) / ⚪ Indistinguishable (±0.05) / 🔴 Regression (< −0.05) Quadrant breakdown Counts each assertion as Signal (✓/✗) / Baseline (✓/✓) / Unreachable (✗/✗) / Regression (✗/✓). The dominant quadrant drives the plan Per-eval drill-down Only divergent assertions get a row; matching ones fold into header counts so the report stays under one screen -
Improvement plan — ≤ 5 ranked proposals in a uniform format (
[TARGET]or[EVALS]tag, location, current quote, replacement, "Lever" line). You pick: apply target edits, iterate on evals, both, or save the plan and stop. Re-runs land in a freshruns/<ts>/so iteration deltas stay legible.
Isolation: each run gets its own sandbox directory; a globally-installed copy of the target in ~/.claude/ (or ~/.codex/ / ~/.agents/) is auto-hidden for the duration and restored afterward (with on-disk recovery manifest covering SIGKILL / power loss / segfault). Conditional-loading frontmatter (path: / paths:) is stripped from the copy installed into the with sandbox so the target loads unconditionally for every prompt — without that, rules scoped to e.g. paths: ["**/*.py"] would stay dormant in both configs and the delta would collapse to 0.00. The source file is never modified.
Key flags: --runs N (default 1), --configs with,without, --workers N, --model, --no-isolate-global, --restore-hidden.
Local web dashboard at localhost:41777 with real-time notifications and 10 views.
Global command center with 8 clickable stat cards and 4 recent cards (Specifications, Requirements, Sessions, Memories). Active specs shown as pills in the top bar; notification bell in the top right.
Browse past sessions with search for both Claude Code and Codex. For Claude Code, copy the session ID and use /resume <session-id> to directly jump back into any session.
Browsable observations — decisions, discoveries, bugfixes — with type filters and search. Each memory shows its session — click to navigate directly to it.
Product requirement documents (PRDs) generated by /prd, with view and annotate modes. Access all previous documents and share them with your team for direct feedback and annotations.
All spec plans generated by /spec with task progress, phase tracking, and iteration history. Annotate mode lets you mark up plans visually before approving, share with teammates via a single link.
Browse, edit, compare, and share your rules, commands, skills, and agents. Connect a git remote to push/pull extensions across your team, with optional APM-compatible export format.
Git diff viewer with staged/unstaged files, branch info, and worktree context. Review mode adds inline annotations on diff lines — the agent reads them directly before marking a spec as verified.
Daily token costs, model routing breakdown, and usage trends across sessions for both Claude Code and Codex sessions. Correlates costs to commits and show savings via CLI proxy integration.
Configure spec workflow toggles, reviewer settings, and Console preferences. Toggle labels show which review agents run on Claude Code + Codex, and which Codex Companion Reviewers require the Claude Code Codex plugin.
Documentation, guides, and quick-start resources to explain the concepts in detail.
For full details on every component, see the Documentation.
See the full changelog at pilot.openchangelog.com.
Found a bug or missing a feature? Open an issue on GitHub.
See LICENSE.
How real engineers run Claude Code and Codex
