Skip to content
482 changes: 482 additions & 0 deletions agent-langgraph/.claude/commands/_agent-evaluation.md

Large diffs are not rendered by default.

289 changes: 289 additions & 0 deletions agent-langgraph/.claude/skills/agent-evaluation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,289 @@
---
name: agent-evaluation
description: Use this when you need to EVALUATE an existing LLM agent's performance - including task completion rate, sub-goal success rate, tool selection accuracy, answer quality, self-correction, safety, cost, and efficiency. Evaluates agents systematically using MLflow evaluation with datasets, scorers, and tracing. Covers end-to-end evaluation workflow or individual components (tracing setup, dataset creation, scorer definition, evaluation execution).
allowed-tools: Read, Write, Bash, Grep, Glob, WebFetch
---

# Agent Evaluation with MLflow

Comprehensive guide for evaluating GenAI agents with MLflow. Use this skill for the complete evaluation workflow or individual components - tracing setup, environment configuration, dataset creation, scorer definition, or evaluation execution. Each section can be used independently based on your needs.

## Table of Contents

1. [Evaluation Overview](#evaluation-overview)
2. [Command Conventions](#command-conventions)
3. [Pre-Flight Validation](#pre-flight-validation)
4. [Documentation Access Protocol](#documentation-access-protocol)
5. [Discovering Agent Server Structure](#discovering-agent-server-structure)
6. [Verify Current Agent](#verify-current-agent)
7. [Evaluation Workflow](#evaluation-workflow)

## Evaluation Overview

**Setup (prerequisite)**: Install MLflow 3.8+, configure environment, integrate tracing

1. **Understand**: Understand agent purpose and strategy
2. **Dataset**: Agent dataset discovery
3. **Define**: Select and create scorers for quality criteria
4. **Evaluate**: Run agent on dataset, apply scorers, analyze results
5. **Record**: Save evaluation procedure for reference, tracking, and history

## Command Conventions

**Always use `uv run` for MLflow and Python commands:**

```bash
uv run mlflow --version # MLflow CLI commands
uv run python scripts/xxx.py # Python script execution
uv run python -c "..." # Python one-liners
```

This ensures commands run in the correct environment with proper dependencies.

**CRITICAL: Separate stderr from stdout when capturing CLI output:**

When saving CLI command output to files for parsing (JSON, CSV, etc.), always redirect stderr separately to avoid mixing logs with structured data:

```bash
# WRONG - mixes progress bars and logs with JSON output
uv run mlflow traces evaluate ... --output json > results.json

# CORRECT - separates stderr from JSON output
uv run mlflow traces evaluate ... --output json 2>/dev/null > results.json

# ALTERNATIVE - save both separately for debugging
uv run mlflow traces evaluate ... --output json > results.json 2> evaluation.log
```

**When to separate streams:**
- Any command with `--output json` flag
- Commands that output structured data (CSV, JSON, XML)
- When piping output to parsing tools (`jq`, `grep`, etc.)

**When NOT to separate:**
- Interactive commands where you want to see progress
- Debugging scenarios where logs provide context
- Commands that only output unstructured text

## Pre-Flight Validation

Validate environment before starting:

```bash
uv run mlflow --version # Should be >=3.8.0
uv run python -c "import mlflow; print(f'MLflow {mlflow.__version__} installed')"
```

If MLflow is missing or version is <3.8.0, see Setup overview here `references/setup-guide`

## Documentation Access Protocol

**CRITICAL: All MLflow documentation must be accessed through llms.txt:**

1. Start at: `https://mlflow.org/docs/latest/llms.txt`
2. Query llms.txt for your topic with specific prompt
3. If llms.txt references another doc, use WebFetch with that URL
4. Do not use WebSearch - use WebFetch with llms.txt first

**This applies to all steps**, especially:

- Dataset creation (read GenAI dataset docs from llms.txt)
- Scorer registration (check MLflow docs for scorer APIs)
- Evaluation execution (understand mlflow.genai.evaluate API)

## Discovering Agent Server Structure

**Each project has unique structure.** Use dynamic exploration instead of assumptions:

### Find Agent Entry Points
```bash
# Search for main agent functions
grep -r "def.*agent" . --include="*.py"
grep -r "def (run|stream|handle|process)" . --include="*.py"

# Check common locations
ls main.py app.py src/*/agent.py 2>/dev/null

# Look for API routes
grep -r "@app\.(get|post)" . --include="*.py" # FastAPI/Flask
grep -r "def.*route" . --include="*.py"
```

### Find Tracing Integration
```bash
# Find autolog calls
grep -r "mlflow.*autolog" . --include="*.py"

# Find trace decorators
grep -r "@mlflow.trace" . --include="*.py"

# Check imports
grep -r "import mlflow" . --include="*.py"
```

### Understand Project Structure
```bash
# Check entry points in package config
cat pyproject.toml setup.py 2>/dev/null | grep -A 5 "scripts\|entry_points"

# Read project documentation
cat README.md docs/*.md 2>/dev/null | head -100

# Explore main directories
ls -la src/ app/ agent/ 2>/dev/null
```

**IMPORTANT: Always let the user know the server structure has been evaluated**

## Verify Current Agent

Complete two verification steps:

1. **Environment Check** (tracking URI and experiment)
2. **Integrate tracing** (autolog and @mlflow.trace decorators)
- ⚠️ **MANDATORY**: Read `references/tracing-integration.md` documentation and implement any changes
- ✓ **VERIFY**: Run `scripts/validate_agent_tracing.py` to validate work

⚠️ **Tracing must work before evaluation.** If tracing fails, stop and troubleshoot.

**Checkpoint - verify before proceeding:**

- [ ] MLflow >=3.8.0 installed
- [ ] MLFLOW_TRACKING_URI and MLFLOW_EXPERIMENT_ID set
- [ ] Autolog enabled and @mlflow.trace decorators added
- [ ] Test run creates a trace (verify trace ID is not None)

## Evaluation Workflow

### Step 1: Understand Agent Purpose

1. Invoke agent with sample input
2. Inspect MLflow trace (especially LLM prompts describing agent purpose)
3. Print your understanding and ask user for verification
4. **Wait for confirmation before proceeding**

### Step 2: Define Quality Scorers

1. **Discover built-in scorers using documentation protocol:**
- Query `https://mlflow.org/docs/latest/llms.txt` for "What built-in LLM judges or scorers are available?"
- Read scorer documentation to understand their purpose and requirements
- Note: Do NOT use `mlflow scorers list -b` - use documentation instead for accurate information

2. **Check registered scorers in your experiment:**
```bash
uv run mlflow scorers list -x $MLFLOW_EXPERIMENT_ID
```

3. Identify quality dimensions for your agent and select appropriate scorers
4. Register scorers and test on sample trace before full evaluation
5. Provide table with Scorer, Purpose, and Selection Reason

**For scorer selection and registration:** See `references/scorers.md`
**For CLI constraints (yes/no format, template variables):** See `references/scorers-constraints.md`

## Step 3: Evaluation Dataset and Ground Truth

Ask the user:

"Would you like to create a ground truth evaluation dataset?

**Benefits of a ground truth dataset:**
- Enables **Correctness** scoring (comparing against expected answers)
- Enables **RetrievalSufficiency** scoring (for RAG agents)
- Enables **Guidelines** and **ExpectationsGuidelines** scoring (adherence to guidelines and expectations)
- Enables **Equivalence** scoring (reponse agrees with predicted response)
- Provides consistent, repeatable evaluation baselines
- Allows tracking improvement over time

**Options:**
1. **Yes** - I'll guide you through creating a synthetic dataset relevant to your use case.
2. **No** - Proceed with scorers that don't require ground truth"

### If User Says YES (Create Evaluation Dataset Step 4):
### If User Says NO (Warn User and Skip to Step 5):

No Ground Truth Dataset Warning:

"**Important Note:** While you can evaluate without ground truth, having a ground truth dataset significantly improves evaluation quality. You'll be limited to scorers that assess general quality rather than correctness against expected answers. Consider creating even a small ground truth dataset (10-15 examples) for your most critical use cases.

Proceeding with scorers that don't require ground truth..."

## Step 4: Prepare Evaluation Dataset

**ALWAYS discover existing datasets first** to prevent duplicate work:

1. **Run dataset discovery** (mandatory):

```bash
uv run python scripts/list_datasets.py # Lists all datasets as table
uv run python scripts/list_datasets.py --format json # For machine-readable output
```

2. **Present findings to user**:

- Show all discovered datasets with their characteristics (size, topics covered)
- If datasets found, highlight most relevant options based on agent type

3. **Ask user about existing datasets**:

- "I found [N] existing evaluation dataset(s). Do you want to use one of these? (y/n)"
- If yes: Ask which dataset to use and record the dataset name
- If no: Proceed to step 5

4. **Create new dataset only if user declined existing ones or No existing datasets found**:
- Prompt user to name test cases file
- Write results file in `agent_server/evaluation/test_cases/`

```bash
uv run python scripts/create_dataset_template.py --test-cases-file <path to test cases file>
# Optional: --dataset-name my-eval --catalog main --schema ml --table eval_v1
```
Review and execute the generated script.

**IMPORTANT**: Do not skip dataset discovery. Always run `list_datasets.py` first, even if you plan to create a new dataset. This prevents duplicate work and ensures users are aware of existing evaluation datasets.

**For complete dataset guide:** See `references/dataset-preparation.md`

### Step 5: Create and Run Evaluation

**Coding Support**
- For coding patterns see `skills/agent-evaluation/patterns/`:

| Reference | Purpose | When to Read |
|-----------|---------|--------------|
| `GOTCHAS.md` | Common mistakes | **Always read first** before writing code |
| `CRITICAL-interfaces.md` | API signatures, schemas | **Always read first** When writing any evaluation code |
| `patterns-evaluation.md` | Running evals, comparing | When executing evaluations |
| `patterns-scorers.md` | Custom scorer creation | When built-in scorers aren't enough |
| `patterns-datasets.md` | Dataset building | When preparing evaluation data |

1. Generate evaluation script:
- Write output to `agent_server/evaluate_agent.py`

```bash
uv run python scripts/run_evaluation_template.py # Auto-detects module, entry point, dataset
# Optional: --module my_agent.agent --entry-point run_agent --dataset-name my-dataset

```
Review and execute the generated script.

2. Apply scorers:
- Prompt user to name results file
- Write results file to `agent_server/evaluation/results`

```bash
# IMPORTANT: Redirect stderr to avoid mixing logs with JSON output
uv run mlflow traces evaluate \
--trace-ids <comma_separated_trace_ids> \
--scorers <scorer1>,<scorer2>,... \
--output json 2>/dev/null > <path to results file>
```

3. Analyze results:
- Prompt user to name evaluation report file
- Write results file to `agent_server/evaluation/reports`

```bash
uv run python scripts/analyze_results.py <path to results file name>
```
Generates `evaluation_report.md` with pass rates, failure patterns, and recommendations.
Loading