Add Claude Code agent evaluation slash command #77

stefano-feijoo-db · 2026-01-09T23:25:17Z

Add a comprehensive Claude Code slash command /agent-evaluation that provides an
Interactive workflow for building evaluation suites for LangGraph agents.

The workflow includes the following steps:
1. Agent Analysis - Reads agent.py to understand tools, model, and
capabilities
2. Requirements Gathering - Asks clarifying questions about agent
purpose, quality dimensions, expected inputs/outputs
3. Synthetic Dataset Creation - Optionally guides users through creating
ground truth test cases with inputs, expectations, and expected facts
4. Scorer Selection - Recommends appropriate MLflow scorers based on
agent type:
- No ground truth: RelevanceToQuery, Safety, Completeness, Fluency,
RetrievalGroundedness, ToolCallCorrectness
- With ground truth: Correctness, RetrievalSufficiency, Equivalence,
Guidelines, ExpectationsGuidelines
- Custom code-based scorers for specific needs
5. Configuration - Creates agent_eval_config.yaml with judge model
endpoints
6. Code Generation - Writes complete evaluate_agent.py with imports,
dataset, scorers, and execution logic
7. Documentation - Generates evaluation_methodology.md with test categories,
scorer justifications, and interpretation guidance

References MLflow GenAI evaluation APIs and Databricks model endpoints.
Enables reproducible, comprehensive agent quality assessment.

stefano-feijoo-db · 2026-01-09T23:28:14Z

Preliminary /agent-evaluation command

bbqiu

this looks really really good! i'm gonna ask someone from the eval team to take a look as well just so we can confirm we're following best practices

please re-request my review whenever! excited to try out the slash command to actually write some evals

bbqiu · 2026-01-10T22:52:31Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+
+First, read and analyze the agent implementation:
+
+1. Read the file at @agent_server/agent.py` to understand:


nit: i think not everyone will have a single file that encapsulates their agent, we might want claude to do a complete survey of the agent_server folder

+1 - tools and configs are typically stored separately

bbqiu · 2026-01-10T22:57:29Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+ - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
+ - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`
+
+Based on the user's answers, recommend scorers from the following MLflow options:


does claude end up remembering that they did / didn't select ground truth?

one thing i've seen other slash commands do is persist some of the info collected into a file -- this could be useful, especially since we're gonna collect information about the agent / use case. we can maybe start the slash command by reading through the persisted file and then working with the user to update whatever content is already in the file

bbqiu · 2026-01-10T22:57:56Z

agent-langgraph/.claude/commands/agent-evaluation.md

+)
+```
+
+## Step 6: Create an Agent Evaluation Configuration


nit: step 5?

bbqiu · 2026-01-10T22:58:52Z

agent-langgraph/.claude/commands/agent-evaluation.md

+]
+
+# Load the agent_eval_config.yaml file 
+with open("./artifacts/configs/agent_eval_config.yaml", "r") as f:


nit: update this

bbqiu · 2026-01-10T23:00:02Z

agent-langgraph/.claude/commands/agent-evaluation.md

+The configuration will contain the models used for each scorer or custom-judge created.
+
+Example configuration file:
+```agent_eval_config.yaml


nit: can use ```yaml here

bbqiu · 2026-01-10T23:01:55Z

agent-langgraph/.claude/commands/_agent-evaluation.md

@@ -0,0 +1,482 @@
+# Agent Evaluation Workflow
+
+You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully.


it may be helpful to share some basic starter eval docs w/ the user as well

smoorjani

This is really similar to something Alkis Polyzotis and Alex Miller have worked on incorporating claude skills into evaluation - see https://databricks.slack.com/archives/C083A8HQC6N/p1767826124661069

smoorjani · 2026-01-12T16:44:55Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+
+First, read and analyze the agent implementation:
+
+1. Read the file at @agent_server/agent.py` to understand:


+1 - tools and configs are typically stored separately

smoorjani · 2026-01-12T16:47:11Z

agent-langgraph/.claude/commands/_agent-evaluation.md

@@ -0,0 +1,482 @@
+# Agent Evaluation Workflow
+
+You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully.


one pattern that is super useful with claude is having a separate folder (e.g., docs) containing a bunch of markdown files which claude can refer to as sort of a "how-to guide". Each file references a different component of the workflow and CLAUDE can selectively pick what it needs to load into its context, leaving more room for high-level instructions in this command.

See this repo as an example: https://github.com/databricks-solutions/custom-mlflow-review-app/tree/main/docs

You can ask claude to read the docs and generate these files.

smoorjani · 2026-01-12T16:48:59Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+Ask the user the following clarifying questions to understand their evaluation needs:
+
+**Question 1 - Agent Purpose:**
+"What is the primary purpose of your agent? For example:


This may be easy enough to infer from the agent's code; however, there are other things like who is the audience/end-user to ascertain things like how technical the agent should be.

smoorjani · 2026-01-12T16:50:05Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+- Other (please describe)"
+
+**Question 2 - Critical Quality Dimensions:**
+"Which quality dimensions are most important for your agent? (Select all that apply):


In many cases, users will have custom quality dimensions, not necessarily the same as what we offer in built-ins. Luckily CLAUDE is pretty good at generating custom metrics!

smoorjani · 2026-01-12T17:04:45Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+
+**Options:**
+1. **Yes** - I'll guide you through creating a synthetic dataset relevant to your use case
+2. **No** - Proceed with scorers that don't require ground truth"


I think we should also point towards how to collect ground truth from subject matter experts or guide the user through the setup process.

smoorjani · 2026-01-12T17:07:56Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+ - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
+ - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`
+
+Based on the user's answers, recommend scorers from the following MLflow options:


can we recommend it directly from the docs? just to keep this list up to date.

smoorjani · 2026-01-12T17:07:59Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+ - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
+ - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`
+
+Based on the user's answers, recommend scorers from the following MLflow options:


smoorjani · 2026-01-12T17:08:40Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+**Indicates experimental features that may change in future releases.
+
+### Custom Code-Based Scorers:
+


make_judge is also super powerful here: https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/make-judge/

smoorjani · 2026-01-12T17:09:37Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+
+```
+
+Best models to leverage as judges as of 12/01/2026:


we can also suggest models from other providers if people don't want to use DBX model serving (e.g., openai:/gpt-5). Most FMAPI models have heavy rate limits which may not work well with larger scale evaluations.

smoorjani · 2026-01-12T17:10:04Z

agent-langgraph/.claude/commands/_agent-evaluation.md

+     "Ensure you have a function decorated with `@invoke()`."
+)
+
+# Wrap async invoke function


I think this is no longer necessary - we handle this wrapping in mlflow.genai.evaluate

Convert the agent-evaluation slash command into a full-featured skill system with extensive documentation, patterns, and utility scripts for evaluating LLM agents using MLflow. Changes: - Rename agent-evaluation.md command to _agent-evaluation.md (internal use) - Add SKILL.md as the main entry point with evaluation workflow - Add patterns documentation: - CRITICAL-interfaces.md: API signatures and data schemas - GOTCHAS.md: Common mistakes and how to avoid them - patterns-datasets.md: Dataset creation best practices - patterns-evaluation.md: Running evaluations and comparisons - patterns-scorers.md: Custom scorer creation patterns - Add reference documentation: - agent-strategy.md: Agent evaluation strategy guide - dataset-preparation.md: Dataset preparation guidelines - scorers.md: Available scorers and usage - scorers-constraints.md: Scorer constraints and requirements - setup-guide.md: MLflow environment setup - tracing-integration.md: MLflow tracing integration - troubleshooting.md: Common issues and solutions - Add utility scripts: - setup_mlflow.py: MLflow environment configuration - validate_auth.py: Authentication validation - validate_environment.py: Environment validation - validate_tracing_static.py: Static tracing validation - validate_tracing_runtime.py: Runtime tracing validation - create_dataset_template.py: Dataset creation template - list_datasets.py: List available evaluation datasets - run_evaluation_template.py: Evaluation execution template - analyze_results.py: Results analysis utilities - Add evaluation report template in assets/ This skill enables systematic agent evaluation with MLflow, covering: - Environment setup and tracing integration - Dataset creation and management - Built-in and custom scorer definition - Evaluation execution and results analysis Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add pattern file descriptions with line counts and usage guidance - Add Step 0 to agent strategy workflow for checking existing strategy files - Clarify when to read CRITICAL-interfaces.md and GOTCHAS.md - Add "Other" option for custom quality dimensions Co-Authored-By: Claude Opus 4.5 <[email protected]>

stefano-feijoo-db · 2026-01-15T17:27:57Z

Key Differences from Main MLflow Skill adapted to Agents on Apps

Reference: MLflow Skills PR

`SKILL.md` Workflow Updates

Documentation Access Protocol: Added to outline patterns for improved model API knowledge.
Evaluation Workflow:
- Step 1.1: Now points to /references/agent-strategy.md to initiate the process, rather than invoking the agent with sample input immediately.
- Step 3 (Dataset & Ground Truth): Added a prompt asking the user if they wish to create a dataset. *Point of discussion.
- Step 5 (Outputs): The evaluation now writes the following specific files:
  - Script: agent_server/evaluate_agent.py
  - Results: agent_server/evaluation/evaluation_results.json
  - Report: agent_server/evaluation/evaluation_report.md

Documentation & Strategy Changes

agent-strategy.md: Updated to utilize specific patterns.
dataset-preparation.md: Now recommends 100+ test cases for production environments.
scorers.md: Explicitly lists available reference-free and ground-truth scorers to prevent import errors.
Design Process: Now references agent_server/evaluation/agent_strategy.md to support the design workflow.

Reorganize skill documentation for better maintainability and clarity: ## Documentation Structure Changes - Extract trace analysis patterns to new `patterns-traces.md` (7 patterns) - Extract bundled resources docs to new `references/bundle-resources.md` - Slim down SKILL.md to focus on workflow orchestration ## SKILL.md Improvements - Update description: "IMPROVE/OPTIMIZE" → "EVALUATE" for clearer intent - Reorganize TOC: add Evaluation Overview, Command Conventions, Pre-Flight - Move coding patterns table to Step 5 for better context - Add user prompts for naming output files (test cases, results, reports) - Emphasize "CRITICAL" for MLflow llms.txt documentation access ## Workflow Refinements - agent-strategy.md: Add "Discovering Agent Server Structure" section - agent-strategy.md: Simplify success criteria format - dataset-preparation.md: Parameterize test-cases-file paths - run_evaluation_template.py: Rename main() → evaluate() ## Content Cleanup - CRITICAL-interfaces.md: Remove redundant Installation/Setup sections - patterns-datasets.md: Focus solely on dataset generation patterns - Renumber patterns after trace patterns extraction Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add Claude Code agent evaluation slash command

6236d4d

stefano-feijoo-db marked this pull request as ready for review January 9, 2026 23:28

stefano-feijoo-db marked this pull request as draft January 9, 2026 23:29

bbqiu self-requested a review January 10, 2026 03:10

bbqiu reviewed Jan 10, 2026

View reviewed changes

smoorjani reviewed Jan 12, 2026

View reviewed changes

stefano-feijoo-db and others added 5 commits January 13, 2026 19:17

Fix multi-turn scorers Use When

dd1f266

Clean repeated scorers tables

bcf8c18

Update scripts and references to match latests mlflow skill

eda5bc8

Added chatbot app to .gitignore to prevent tracking

718a9db

bbqiu self-requested a review January 16, 2026 19:17


		First, read and analyze the agent implementation:

		1. Read the file at @agent_server/agent.py` to understand:

		@@ -0,0 +1,482 @@
		# Agent Evaluation Workflow

		You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully.

		**Indicates experimental features that may change in future releases.

		### Custom Code-Based Scorers:

Add Claude Code agent evaluation slash command #77

Are you sure you want to change the base?

Add Claude Code agent evaluation slash command #77

Uh oh!

Conversation

stefano-feijoo-db commented Jan 9, 2026

Uh oh!

stefano-feijoo-db commented Jan 9, 2026

Uh oh!

bbqiu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

smoorjani left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stefano-feijoo-db commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Differences from Main MLflow Skill adapted to Agents on Apps

SKILL.md Workflow Updates

Documentation & Strategy Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stefano-feijoo-db commented Jan 15, 2026 •

edited

Loading

`SKILL.md` Workflow Updates