Skip to content

Conversation

@stefano-feijoo-db
Copy link

Add a comprehensive Claude Code slash command /agent-evaluation that provides an
Interactive workflow for building evaluation suites for LangGraph agents.

The workflow includes the following steps:
1. Agent Analysis - Reads agent.py to understand tools, model, and
capabilities
2. Requirements Gathering - Asks clarifying questions about agent
purpose, quality dimensions, expected inputs/outputs
3. Synthetic Dataset Creation - Optionally guides users through creating
ground truth test cases with inputs, expectations, and expected facts
4. Scorer Selection - Recommends appropriate MLflow scorers based on
agent type:
- No ground truth: RelevanceToQuery, Safety, Completeness, Fluency,
RetrievalGroundedness, ToolCallCorrectness
- With ground truth: Correctness, RetrievalSufficiency, Equivalence,
Guidelines, ExpectationsGuidelines
- Custom code-based scorers for specific needs
5. Configuration - Creates agent_eval_config.yaml with judge model
endpoints
6. Code Generation - Writes complete evaluate_agent.py with imports,
dataset, scorers, and execution logic
7. Documentation - Generates evaluation_methodology.md with test categories,
scorer justifications, and interpretation guidance

References MLflow GenAI evaluation APIs and Databricks model endpoints.
Enables reproducible, comprehensive agent quality assessment.

@stefano-feijoo-db
Copy link
Author

Preliminary /agent-evaluation command

@stefano-feijoo-db stefano-feijoo-db marked this pull request as ready for review January 9, 2026 23:28
@stefano-feijoo-db stefano-feijoo-db marked this pull request as draft January 9, 2026 23:29
@bbqiu bbqiu self-requested a review January 10, 2026 03:10
Copy link
Contributor

@bbqiu bbqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks really really good! i'm gonna ask someone from the eval team to take a look as well just so we can confirm we're following best practices

please re-request my review whenever! excited to try out the slash command to actually write some evals


First, read and analyze the agent implementation:

1. Read the file at @agent_server/agent.py` to understand:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think not everyone will have a single file that encapsulates their agent, we might want claude to do a complete survey of the agent_server folder

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - tools and configs are typically stored separately

- **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
- **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`

Based on the user's answers, recommend scorers from the following MLflow options:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does claude end up remembering that they did / didn't select ground truth?

one thing i've seen other slash commands do is persist some of the info collected into a file -- this could be useful, especially since we're gonna collect information about the agent / use case. we can maybe start the slash command by reading through the persisted file and then working with the user to update whatever content is already in the file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

)
```

## Step 6: Create an Agent Evaluation Configuration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: step 5?

]

# Load the agent_eval_config.yaml file
with open("./artifacts/configs/agent_eval_config.yaml", "r") as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: update this

The configuration will contain the models used for each scorer or custom-judge created.

Example configuration file:
```agent_eval_config.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can use ```yaml here

@@ -0,0 +1,482 @@
# Agent Evaluation Workflow

You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be helpful to share some basic starter eval docs w/ the user as well

Copy link

@smoorjani smoorjani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really similar to something Alkis Polyzotis and Alex Miller have worked on incorporating claude skills into evaluation - see https://databricks.slack.com/archives/C083A8HQC6N/p1767826124661069


First, read and analyze the agent implementation:

1. Read the file at @agent_server/agent.py` to understand:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - tools and configs are typically stored separately

@@ -0,0 +1,482 @@
# Agent Evaluation Workflow

You are an expert in LLM evaluation and MLflow. Your goal is to guide the user through creating a comprehensive evaluation suite for their LangGraph agent. Follow this workflow carefully.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one pattern that is super useful with claude is having a separate folder (e.g., docs) containing a bunch of markdown files which claude can refer to as sort of a "how-to guide". Each file references a different component of the workflow and CLAUDE can selectively pick what it needs to load into its context, leaving more room for high-level instructions in this command.

See this repo as an example: https://github.com/databricks-solutions/custom-mlflow-review-app/tree/main/docs

You can ask claude to read the docs and generate these files.

Ask the user the following clarifying questions to understand their evaluation needs:

**Question 1 - Agent Purpose:**
"What is the primary purpose of your agent? For example:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be easy enough to infer from the agent's code; however, there are other things like who is the audience/end-user to ascertain things like how technical the agent should be.

- Other (please describe)"

**Question 2 - Critical Quality Dimensions:**
"Which quality dimensions are most important for your agent? (Select all that apply):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases, users will have custom quality dimensions, not necessarily the same as what we offer in built-ins. Luckily CLAUDE is pretty good at generating custom metrics!


**Options:**
1. **Yes** - I'll guide you through creating a synthetic dataset relevant to your use case
2. **No** - Proceed with scorers that don't require ground truth"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also point towards how to collect ground truth from subject matter experts or guide the user through the setup process.

- **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
- **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`

Based on the user's answers, recommend scorers from the following MLflow options:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we recommend it directly from the docs? just to keep this list up to date.

- **Custom LLM Judge**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/custom-judge/`
- **Guidelines-based LLM Scorers**: `https://docs.databricks.com/aws/en/mlflow3/genai/eval-monitor/concepts/judges/guidelines`

Based on the user's answers, recommend scorers from the following MLflow options:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

**Indicates experimental features that may change in future releases.

### Custom Code-Based Scorers:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


```

Best models to leverage as judges as of 12/01/2026:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also suggest models from other providers if people don't want to use DBX model serving (e.g., openai:/gpt-5). Most FMAPI models have heavy rate limits which may not work well with larger scale evaluations.

"Ensure you have a function decorated with `@invoke()`."
)

# Wrap async invoke function

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer necessary - we handle this wrapping in mlflow.genai.evaluate

stefano-feijoo-db and others added 5 commits January 13, 2026 19:17
Convert the agent-evaluation slash command into a full-featured skill system
with extensive documentation, patterns, and utility scripts for evaluating
LLM agents using MLflow.

Changes:
- Rename agent-evaluation.md command to _agent-evaluation.md (internal use)
- Add SKILL.md as the main entry point with evaluation workflow
- Add patterns documentation:
  - CRITICAL-interfaces.md: API signatures and data schemas
  - GOTCHAS.md: Common mistakes and how to avoid them
  - patterns-datasets.md: Dataset creation best practices
  - patterns-evaluation.md: Running evaluations and comparisons
  - patterns-scorers.md: Custom scorer creation patterns
- Add reference documentation:
  - agent-strategy.md: Agent evaluation strategy guide
  - dataset-preparation.md: Dataset preparation guidelines
  - scorers.md: Available scorers and usage
  - scorers-constraints.md: Scorer constraints and requirements
  - setup-guide.md: MLflow environment setup
  - tracing-integration.md: MLflow tracing integration
  - troubleshooting.md: Common issues and solutions
- Add utility scripts:
  - setup_mlflow.py: MLflow environment configuration
  - validate_auth.py: Authentication validation
  - validate_environment.py: Environment validation
  - validate_tracing_static.py: Static tracing validation
  - validate_tracing_runtime.py: Runtime tracing validation
  - create_dataset_template.py: Dataset creation template
  - list_datasets.py: List available evaluation datasets
  - run_evaluation_template.py: Evaluation execution template
  - analyze_results.py: Results analysis utilities
- Add evaluation report template in assets/

This skill enables systematic agent evaluation with MLflow, covering:
- Environment setup and tracing integration
- Dataset creation and management
- Built-in and custom scorer definition
- Evaluation execution and results analysis

Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add pattern file descriptions with line counts and usage guidance
- Add Step 0 to agent strategy workflow for checking existing strategy files
- Clarify when to read CRITICAL-interfaces.md and GOTCHAS.md
- Add "Other" option for custom quality dimensions

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@stefano-feijoo-db
Copy link
Author

stefano-feijoo-db commented Jan 15, 2026

Key Differences from Main MLflow Skill adapted to Agents on Apps

Reference: MLflow Skills PR

SKILL.md Workflow Updates

  • Documentation Access Protocol: Added to outline patterns for improved model API knowledge.
  • Evaluation Workflow:
    • Step 1.1: Now points to /references/agent-strategy.md to initiate the process, rather than invoking the agent with sample input immediately.
    • Step 3 (Dataset & Ground Truth): Added a prompt asking the user if they wish to create a dataset. *Point of discussion.
    • Step 5 (Outputs): The evaluation now writes the following specific files:
      • Script: agent_server/evaluate_agent.py
      • Results: agent_server/evaluation/evaluation_results.json
      • Report: agent_server/evaluation/evaluation_report.md

Documentation & Strategy Changes

  • agent-strategy.md: Updated to utilize specific patterns.
  • dataset-preparation.md: Now recommends 100+ test cases for production environments.
  • scorers.md: Explicitly lists available reference-free and ground-truth scorers to prevent import errors.
  • Design Process: Now references agent_server/evaluation/agent_strategy.md to support the design workflow.

@bbqiu bbqiu self-requested a review January 16, 2026 19:17
Reorganize skill documentation for better maintainability and clarity:

## Documentation Structure Changes
- Extract trace analysis patterns to new `patterns-traces.md` (7 patterns)
- Extract bundled resources docs to new `references/bundle-resources.md`
- Slim down SKILL.md to focus on workflow orchestration

## SKILL.md Improvements
- Update description: "IMPROVE/OPTIMIZE" → "EVALUATE" for clearer intent
- Reorganize TOC: add Evaluation Overview, Command Conventions, Pre-Flight
- Move coding patterns table to Step 5 for better context
- Add user prompts for naming output files (test cases, results, reports)
- Emphasize "CRITICAL" for MLflow llms.txt documentation access

## Workflow Refinements
- agent-strategy.md: Add "Discovering Agent Server Structure" section
- agent-strategy.md: Simplify success criteria format
- dataset-preparation.md: Parameterize test-cases-file paths
- run_evaluation_template.py: Rename main() → evaluate()

## Content Cleanup
- CRITICAL-interfaces.md: Remove redundant Installation/Setup sections
- patterns-datasets.md: Focus solely on dataset generation patterns
- Renumber patterns after trace patterns extraction

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants