docker · dvdksn · Dec 17, 2025 · Dec 12, 2025
diff --git a/_vale/config/vocabularies/Docker/accept.txt b/_vale/config/vocabularies/Docker/accept.txt
@@ -294,4 +294,4 @@ Zsh
 [Ww]alkthrough
 [Tt]oolsets?
 [Rr]erank(ing|ed)?
-
+[Ee]vals?
@@ -2,7 +2,7 @@
 title: Best practices
 description: Patterns and techniques for building effective cagent agents
 keywords: [cagent, best practices, patterns, agent design, optimization]
-weight: 20
+weight: 40
 ---
 
 Patterns you learn from building and running cagent agents. These aren't

@@ -0,0 +1,163 @@
+---
+title: Evals
+description: Test your agents with saved conversations
+keywords: [cagent, evaluations, testing, evals]
+weight: 80
+---
+
+Evaluations (evals) help you track how your agent's behavior changes over time.
+When you save a conversation as an eval, you can replay it later to see if the
+agent responds differently. Evals measure consistency, not correctness - they
+tell you if behavior changed, not whether it's right or wrong.
+
+## What are evals
+
+An eval is a saved conversation you can replay. When you run evals, cagent
+replays the user messages and compares the new responses against the original
+saved conversation. High scores mean the agent behaved similarly; low scores
+mean behavior changed.
+
+What you do with that information depends on why you saved the conversation.
+You might save successful conversations to catch regressions, or save failure
+cases to document known issues and track whether they improve.
+
+## Common workflows
+
+How you use evals depends on what you're trying to accomplish:
+
+Regression testing: Save conversations where your agent performs well. When you
+make changes later (upgrade models, update prompts, refactor code), run the
+evals. High scores mean behavior stayed consistent, which is usually what you
+want. Low scores mean something changed - examine the new behavior to see if
+it's still correct.
+
+Tracking improvements: Save conversations where your agent struggles or fails.
+As you make improvements, run these evals to see how behavior evolves. Low
+scores indicate the agent now behaves differently, which might mean you fixed
+the issue. You'll need to manually verify the new behavior is actually better.
+
+Documenting edge cases: Save interesting or unusual conversations regardless of
+quality. Use them to understand how your agent handles edge cases and whether
+that behavior changes over time.
+
+Evals measure whether behavior changed. You determine if that change is good or
+bad.
+
+## Creating an eval
+
+Save a conversation from an interactive session:
+
+```console
+$ cagent run ./agent.yaml
+```
+
+Have a conversation with your agent, then save it as an eval:
+
+```console
+> /eval test-case-name
+Eval saved to evals/test-case-name.json
+```
+
+The conversation is saved to the `evals/` directory in your current working
+directory. You can organize eval files in subdirectories if needed.
+
+## Running evals
+
+Run all evals in the default directory:
+
+```console
+$ cagent eval ./agent.yaml
+```
+
+Use a custom eval directory:
+
+```console
+$ cagent eval ./agent.yaml ./my-evals
+```
+
+Run evals against an agent from a registry:
+
+```console
+$ cagent eval agentcatalog/myagent
+```
+
+Example output:
+
+```console
+$ cagent eval ./agent.yaml
+--- 0
+First message: tell me something interesting about kil
+Eval file: c7e556c5-dae5-4898-a38c-73cc8e0e6abe
+Tool trajectory score: 1.000000
+Rouge-1 score: 0.447368
+Cost: 0.00
+Output tokens: 177
+```
+
+## Understanding results
+
+For each eval, cagent shows:
+
+- **First message** - The initial user message from the saved conversation
+- **Eval file** - The UUID of the eval file being run
+- **Tool trajectory score** - How similarly the agent used tools (0-1 scale,
+  higher is better)
+- **[ROUGE-1](https://en.wikipedia.org/wiki/ROUGE_(metric)) score** - Text
+  similarity between responses (0-1 scale, higher is better)
+- **Cost** - The cost for this eval run
+- **Output tokens** - Number of tokens generated
+
+Higher scores mean the agent behaved more similarly to the original recorded
+conversation. A score of 1.0 means identical behavior.
+
+### What the scores mean
+
+**Tool trajectory score** measures whether the agent called the same tools in
+the same order as the original conversation. Lower scores might indicate the
+agent found a different approach to solve the problem, which isn't necessarily
+wrong but worth investigating.
+
+**Rouge-1 score** measures how similar the response text is to the original.
+This is a heuristic measure - different wording might still be correct, so use
+this as a signal rather than absolute truth.
+
+### Interpreting your results
+
+Scores close to 1.0 mean your changes maintained consistent behavior - the
+agent is using the same approach and producing similar responses. This is
+generally good; your changes didn't break existing functionality.
+
+Lower scores mean behavior changed compared to the saved conversation. This
+could be a regression where the agent now performs worse, or it could be an
+improvement where the agent found a better approach.
+
+When scores drop, examine the actual behavior to determine if it's better or
+worse. The eval files are stored as JSON in your evals directory - open the
+file to see the original conversation. Then test your modified agent with the
+same input to compare responses. If the new response is better, save a new
+conversation to replace the eval. If it's worse, you found a regression.
+
+The scores guide you to what changed. Your judgment determines if the change is
+good or bad.
+
+## When to use evals
+
+Evals help you track behavior changes over time. They're useful for catching
+regressions when you upgrade models or dependencies, documenting known failure
+cases you want to fix, and understanding how edge cases evolve as you iterate.
+
+Evals aren't appropriate for determining which agent configuration works best -
+they measure similarity to a saved conversation, not correctness. Use manual
+testing to evaluate different configurations and decide which works better.
+
+Save conversations worth tracking. Build a collection of important workflows,
+interesting edge cases, and known issues. Run your evals when making changes to
+see what shifted.
+
+## What's next
+
+- Check the [CLI reference](reference/cli.md#eval) for all `cagent eval`
+  options
+- Learn [best practices](best-practices.md) for building effective agents
+- Review [example configurations](https://github.com/docker/cagent/tree/main/examples)
+  for different agent types
@@ -1,6 +1,53 @@
 ---
-build:
-  render: never
 title: Integrations
-weight: 50
+description: Connect cagent agents to editors, MCP clients, and other agents
+keywords: [cagent, integration, acp, mcp, a2a, editor, protocol]
+weight: 60
 ---
+
+cagent agents can integrate with different environments depending on how you
+want to use them. Each integration type serves a specific purpose.
+
+## Integration types
+
+### ACP - Editor integration
+
+Run cagent agents directly in your editor (Neovim, Zed). The agent sees your
+editor's file context and can read and modify files through the editor's
+interface. Use ACP when you want an AI coding assistant embedded in your
+editor.
+
+See [ACP integration](./acp.md) for setup instructions.
+
+### MCP - Tool integration
+
+Expose cagent agents as tools in MCP clients like Claude Desktop or Claude
+Code. Your agents appear in the client's tool list, and the client can call
+them when needed. Use MCP when you want Claude Desktop (or another MCP client)
+to have access to your specialized agents.
+
+See [MCP integration](./mcp.md) for setup instructions.
+
+### A2A - Agent-to-agent communication
+
+Run cagent agents as HTTP servers that other agents or systems can call using
+the Agent-to-Agent protocol. Your agent becomes a service that other systems
+can discover and invoke over the network. Use A2A when you want to build
+multi-agent systems or expose your agent as an HTTP service.
+
+See [A2A integration](./a2a.md) for setup instructions.
+
+## Choosing the right integration
+
+| Feature       | ACP                | MCP                | A2A                  |
+| ------------- | ------------------ | ------------------ | -------------------- |
+| Use case      | Editor integration | Agents as tools    | Agent-to-agent calls |
+| Transport     | stdio              | stdio/SSE          | HTTP                 |
+| Discovery     | Editor plugin      | Server manifest    | Agent card           |
+| Best for      | Code editing       | Tool integration   | Multi-agent systems  |
+| Communication | Editor calls agent | Client calls tools | Between agents       |
+
+Choose ACP if you want your agent embedded in your editor while you code.
+Choose MCP if you want Claude Desktop (or another MCP client) to be able to
+call your specialized agents as tools. Choose A2A if you're building
+multi-agent systems where agents need to call each other over HTTP.