-
Notifications
You must be signed in to change notification settings - Fork 70
Add Claude Code agent evaluation slash command #77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Claude Code agent evaluation slash command #77
Conversation
|
Preliminary |
bbqiu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks really really good! i'm gonna ask someone from the eval team to take a look as well just so we can confirm we're following best practices
please re-request my review whenever! excited to try out the slash command to actually write some evals
|
|
||
| First, read and analyze the agent implementation: | ||
|
|
||
| 1. Read the file at @agent_server/agent.py` to understand: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think not everyone will have a single file that encapsulates their agent, we might want claude to do a complete survey of the agent_server folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - tools and configs are typically stored separately
| - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/` | ||
| - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines` | ||
|
|
||
| Based on the user's answers, recommend scorers from the following MLflow options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does claude end up remembering that they did / didn't select ground truth?
one thing i've seen other slash commands do is persist some of the info collected into a file -- this could be useful, especially since we're gonna collect information about the agent / use case. we can maybe start the slash command by reading through the persisted file and then working with the user to update whatever content is already in the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| ) | ||
| ``` | ||
|
|
||
| ## Step 6: Create an Agent Evaluation Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: step 5?
| ] | ||
|
|
||
| # Load the agent_eval_config.yaml file | ||
| with open("./artifacts/configs/agent_eval_config.yaml", "r") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: update this
| The configuration will contain the models used for each scorer or custom-judge created. | ||
|
|
||
| Example configuration file: | ||
| ```agent_eval_config.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can use ```yaml here
| @@ -0,0 +1,482 @@ | |||
| # Agent Evaluation Workflow | |||
|
|
|||
| You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it may be helpful to share some basic starter eval docs w/ the user as well
smoorjani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really similar to something Alkis Polyzotis and Alex Miller have worked on incorporating claude skills into evaluation - see https://databricks.slack.com/archives/C083A8HQC6N/p1767826124661069
|
|
||
| First, read and analyze the agent implementation: | ||
|
|
||
| 1. Read the file at @agent_server/agent.py` to understand: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - tools and configs are typically stored separately
| @@ -0,0 +1,482 @@ | |||
| # Agent Evaluation Workflow | |||
|
|
|||
| You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one pattern that is super useful with claude is having a separate folder (e.g., docs) containing a bunch of markdown files which claude can refer to as sort of a "how-to guide". Each file references a different component of the workflow and CLAUDE can selectively pick what it needs to load into its context, leaving more room for high-level instructions in this command.
See this repo as an example: https://github.com/databricks-solutions/custom-mlflow-review-app/tree/main/docs
You can ask claude to read the docs and generate these files.
| Ask the user the following clarifying questions to understand their evaluation needs: | ||
|
|
||
| **Question 1 - Agent Purpose:** | ||
| "What is the primary purpose of your agent? For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be easy enough to infer from the agent's code; however, there are other things like who is the audience/end-user to ascertain things like how technical the agent should be.
| - Other (please describe)" | ||
|
|
||
| **Question 2 - Critical Quality Dimensions:** | ||
| "Which quality dimensions are most important for your agent? (Select all that apply): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases, users will have custom quality dimensions, not necessarily the same as what we offer in built-ins. Luckily CLAUDE is pretty good at generating custom metrics!
|
|
||
| **Options:** | ||
| 1. **Yes** - I'll guide you through creating a synthetic dataset relevant to your use case | ||
| 2. **No** - Proceed with scorers that don't require ground truth" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also point towards how to collect ground truth from subject matter experts or guide the user through the setup process.
| - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/` | ||
| - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines` | ||
|
|
||
| Based on the user's answers, recommend scorers from the following MLflow options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we recommend it directly from the docs? just to keep this list up to date.
| - **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/` | ||
| - **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines` | ||
|
|
||
| Based on the user's answers, recommend scorers from the following MLflow options: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| **Indicates experimental features that may change in future releases. | ||
|
|
||
| ### Custom Code-Based Scorers: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_judge is also super powerful here: https://mlflow.org/docs/latest/genai/eval-monitor/scorers/llm-judge/make-judge/
|
|
||
| ``` | ||
|
|
||
| Best models to leverage as judges as of 12/01/2026: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can also suggest models from other providers if people don't want to use DBX model serving (e.g., openai:/gpt-5). Most FMAPI models have heavy rate limits which may not work well with larger scale evaluations.
| "Ensure you have a function decorated with `@invoke()`." | ||
| ) | ||
|
|
||
| # Wrap async invoke function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer necessary - we handle this wrapping in mlflow.genai.evaluate
Convert the agent-evaluation slash command into a full-featured skill system with extensive documentation, patterns, and utility scripts for evaluating LLM agents using MLflow. Changes: - Rename agent-evaluation.md command to _agent-evaluation.md (internal use) - Add SKILL.md as the main entry point with evaluation workflow - Add patterns documentation: - CRITICAL-interfaces.md: API signatures and data schemas - GOTCHAS.md: Common mistakes and how to avoid them - patterns-datasets.md: Dataset creation best practices - patterns-evaluation.md: Running evaluations and comparisons - patterns-scorers.md: Custom scorer creation patterns - Add reference documentation: - agent-strategy.md: Agent evaluation strategy guide - dataset-preparation.md: Dataset preparation guidelines - scorers.md: Available scorers and usage - scorers-constraints.md: Scorer constraints and requirements - setup-guide.md: MLflow environment setup - tracing-integration.md: MLflow tracing integration - troubleshooting.md: Common issues and solutions - Add utility scripts: - setup_mlflow.py: MLflow environment configuration - validate_auth.py: Authentication validation - validate_environment.py: Environment validation - validate_tracing_static.py: Static tracing validation - validate_tracing_runtime.py: Runtime tracing validation - create_dataset_template.py: Dataset creation template - list_datasets.py: List available evaluation datasets - run_evaluation_template.py: Evaluation execution template - analyze_results.py: Results analysis utilities - Add evaluation report template in assets/ This skill enables systematic agent evaluation with MLflow, covering: - Environment setup and tracing integration - Dataset creation and management - Built-in and custom scorer definition - Evaluation execution and results analysis Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add pattern file descriptions with line counts and usage guidance - Add Step 0 to agent strategy workflow for checking existing strategy files - Clarify when to read CRITICAL-interfaces.md and GOTCHAS.md - Add "Other" option for custom quality dimensions Co-Authored-By: Claude Opus 4.5 <[email protected]>
Key Differences from Main MLflow Skill adapted to Agents on AppsReference: MLflow Skills PR
|
Reorganize skill documentation for better maintainability and clarity: ## Documentation Structure Changes - Extract trace analysis patterns to new `patterns-traces.md` (7 patterns) - Extract bundled resources docs to new `references/bundle-resources.md` - Slim down SKILL.md to focus on workflow orchestration ## SKILL.md Improvements - Update description: "IMPROVE/OPTIMIZE" → "EVALUATE" for clearer intent - Reorganize TOC: add Evaluation Overview, Command Conventions, Pre-Flight - Move coding patterns table to Step 5 for better context - Add user prompts for naming output files (test cases, results, reports) - Emphasize "CRITICAL" for MLflow llms.txt documentation access ## Workflow Refinements - agent-strategy.md: Add "Discovering Agent Server Structure" section - agent-strategy.md: Simplify success criteria format - dataset-preparation.md: Parameterize test-cases-file paths - run_evaluation_template.py: Rename main() → evaluate() ## Content Cleanup - CRITICAL-interfaces.md: Remove redundant Installation/Setup sections - patterns-datasets.md: Focus solely on dataset generation patterns - Renumber patterns after trace patterns extraction Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add a comprehensive Claude Code slash command
/agent-evaluationthat provides anInteractive workflow for building evaluation suites for LangGraph agents.
The workflow includes the following steps:
1. Agent Analysis - Reads agent.py to understand tools, model, and
capabilities
2. Requirements Gathering - Asks clarifying questions about agent
purpose, quality dimensions, expected inputs/outputs
3. Synthetic Dataset Creation - Optionally guides users through creating
ground truth test cases with inputs, expectations, and expected facts
4. Scorer Selection - Recommends appropriate MLflow scorers based on
agent type:
- No ground truth: RelevanceToQuery, Safety, Completeness, Fluency,
RetrievalGroundedness, ToolCallCorrectness
- With ground truth: Correctness, RetrievalSufficiency, Equivalence,
Guidelines, ExpectationsGuidelines
- Custom code-based scorers for specific needs
5. Configuration - Creates agent_eval_config.yaml with judge model
endpoints
6. Code Generation - Writes complete evaluate_agent.py with imports,
dataset, scorers, and execution logic
7. Documentation - Generates evaluation_methodology.md with test categories,
scorer justifications, and interpretation guidance
References MLflow GenAI evaluation APIs and Databricks model endpoints.
Enables reproducible, comprehensive agent quality assessment.