AI SDK benchmarking tool that tests AI agents with MCP (Model Context Protocol) integration. Automatically discovers and runs all tests in the tests/ directory, verifying LLM-generated Svelte components against test suites.
To install dependencies:
bun installTo set up .env:
cp .env.example .envThen configure your API keys and model in .env:
# Required: Choose your model
MODEL=anthropic/claude-sonnet-4
ANTHROPIC_API_KEY=your_key_here
# Optional: Enable MCP integration (leave empty to disable)
MCP_SERVER_URL=https://mcp.svelte.dev/mcpRequired:
MODEL: The AI model to use (e.g.,anthropic/claude-sonnet-4,openai/gpt-5,openrouter/anthropic/claude-sonnet-4,lmstudio/model-name)- Corresponding API key (
ANTHROPIC_API_KEY,OPENAI_API_KEY, orOPENROUTER_API_KEY)- Note: No API key required for
lmstudio/*models (runs locally)
- Note: No API key required for
Optional:
MCP_SERVER_URL: MCP server URL (leave empty to disable MCP integration)
Cloud Providers:
anthropic/*- Direct Anthropic API (requiresANTHROPIC_API_KEY)openai/*- Direct OpenAI API (requiresOPENAI_API_KEY)openrouter/*- OpenRouter unified API (requiresOPENROUTER_API_KEY)
Local Providers:
lmstudio/*- LM Studio local server (requires LM Studio running onhttp://localhost:1234)
Example configurations:
# Anthropic
MODEL=anthropic/claude-sonnet-4
ANTHROPIC_API_KEY=sk-ant-...
# OpenAI
MODEL=openai/gpt-5
OPENAI_API_KEY=sk-...
# OpenRouter
MODEL=openrouter/anthropic/claude-sonnet-4
OPENROUTER_API_KEY=sk-or-...
# LM Studio (local)
MODEL=lmstudio/llama-3-8b
# No API key needed - make sure LM Studio is running!To run the benchmark (automatically discovers and runs all tests):
bun run index.tsThe benchmark will:
- Discover all tests in
tests/directory - For each test:
- Run the AI agent with the test's prompt
- Extract the generated Svelte component
- Verify the component against the test suite
- Generate a combined report with all results
Results are saved to the results/ directory with timestamped filenames:
results/result-2024-12-07-14-30-45.json- Full execution trace with all test resultsresults/result-2024-12-07-14-30-45.html- Interactive HTML report with expandable test sections
The HTML report includes:
- Summary bar showing passed/failed/skipped counts
- Expandable sections for each test
- Step-by-step execution trace
- Generated component code
- Test verification results with pass/fail details
- Token usage statistics
- MCP status badge
- Dark/light theme toggle
To regenerate an HTML report from a JSON file:
# Regenerate most recent result
bun run generate-report.ts
# Regenerate specific result
bun run generate-report.ts results/result-2024-12-07-14-30-45.jsonEach test in the tests/ directory should have:
tests/
{test-name}/
Reference.svelte - Reference implementation (known-good solution)
test.ts - Vitest test file (imports "./Component.svelte")
prompt.md - Prompt for the AI agent
The benchmark:
- Reads the prompt from
prompt.md - Asks the agent to generate a component
- Writes the generated component to a temporary location
- Runs the tests against the generated component
- Reports pass/fail status
To verify that all reference implementations pass their tests:
bun run verify-testsThis copies each Reference.svelte to Component.svelte temporarily and runs the tests.
The tool supports optional integration with MCP (Model Context Protocol) servers:
- Enabled: Set
MCP_SERVER_URLto a valid MCP server URL - Disabled: Leave
MCP_SERVER_URLempty or unset
MCP status is documented in both the JSON metadata and displayed as a badge in the HTML report.
0: All tests passed1: One or more tests failed
See AGENTS.md for detailed documentation on:
- Architecture and components
- Environment variables and model configuration
- MCP integration details
- Development commands
- Multi-test result format
This project was created using bun init in bun v1.3.3. Bun is a fast all-in-one JavaScript runtime.