Skip to content

feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/actions/spelling/allow.txt
Original file line number Diff line number Diff line change
Expand Up @@ -447,6 +447,7 @@ TLDR
TOKENLIST
TPU
TPUs
TRK
TSLA
TSMC
TSNE
Expand Down Expand Up @@ -545,6 +546,7 @@ aextract
afrom
agentic
agg
aggfunc
ainit
ainvoke
aio
Expand Down Expand Up @@ -603,6 +605,7 @@ bnb
booktitle
boop
boundings
boxplot
bpa
bpd
bqdf
Expand Down Expand Up @@ -925,6 +928,7 @@ linestyle
linkedin
linted
linting
litellm
llm
llms
loghub
Expand Down Expand Up @@ -1282,6 +1286,7 @@ vectoral
vectordb
veo
vesselin
viridis
vllm
vnc
voiceover
Expand Down
3 changes: 0 additions & 3 deletions .github/actions/spelling/line_forbidden.patterns
Original file line number Diff line number Diff line change
Expand Up @@ -283,9 +283,6 @@
# Should be Colab
\s(?!Colab)Co[Ll][Ll]?abs?\b

# Should be Kaggle
\skaggle\b

# Should be TPU or TPUs
\btpus?\b

Expand Down
3 changes: 3 additions & 0 deletions gemini/evaluation/synthetic-data-evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
data
evaluation_results
*.png
170 changes: 170 additions & 0 deletions gemini/evaluation/synthetic-data-evals/README.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have a Jupyter notebook showing how to run this, can you change this README into a notebook showing how this can be done

Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Agent Evaluation Framework

This repository contains a framework for generating, evaluating, and analyzing the performance of LLM-powered agents in customer support scenarios.

## Overview

The framework consists of three main components:

1. **Customer Support Agent** - An LLM-powered agent with tools for handling e-commerce customer queries
2. **Dataset Generator** - A system for creating synthetic evaluation datasets with realistic customer queries
3. **Agent Evaluator** - A comprehensive evaluation system for measuring agent performance

## Customer Support Agent

The customer support agent is built using the `smolagents` framework and provides several tools for handling e-commerce queries:

- `ProductSearchTool` - Search product catalog by name, category, or description
- `OrderStatusTool` - Check order status by order ID
- `CategoryBrowseTool` - Browse products by category
- `PriceCheckTool` - Check product price by product ID
- `CustomerOrderHistoryTool` - Get order history for a customer

The agent can be configured with different LLM models, including Gemini 1.5 Pro, and supports planning capabilities to handle complex multi-step queries.

```python
agent = create_customer_support_agent(
model_id="google/gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_id="google/gemini-1.5-pro",
model_id="google/gemini-2.0-flash",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While gemini-2.0-flash is a valid model ID, it's important to ensure that the model is available and suitable for the intended use case within the create_customer_support_agent function. It would be good to verify that this model is the most appropriate choice for this specific agent and evaluation scenario.

Suggested change
model_id="google/gemini-1.5-pro",
model_id="google/gemini-2.0-flash",

use_weave=True,
temperature=0.2,
planning_interval=1,
max_steps=3
)
```

## Dataset Generator

The dataset generator creates realistic evaluation examples by:

1. Generating diverse e-commerce customer queries
2. Running the agent on these queries and recording its trajectory
3. Evaluating each step and the final response using a judge model
4. Filtering examples based on quality thresholds
5. Saving high-quality examples to a dataset for evaluation

```python
generator = DatasetGenerator(
agent=agent,
judge_model="gemini/gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
judge_model="gemini/gemini-1.5-pro",
judge_model="gemini/gemini-2.0-flash",

Do the same throughout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that consistency in model naming is important for clarity. I suggest applying this change throughout the document wherever gemini-1.5-pro is used for the judge model.

Suggested change
judge_model="gemini/gemini-1.5-pro",
judge_model="gemini/gemini-2.0-flash",

thresholds={
"final_response": 0.7,
"single_step": 0.7,
"trajectory": 0.7
},
debug=True
)

examples = create_customer_support_agent_evaluation_dataset(generator, agent, num_prompts=10)
```

## Agent Evaluator

The evaluator provides comprehensive metrics for agent performance:

- **Response Correctness** - Accuracy and completeness of the agent's final response
- **Tool Selection** - Appropriate use of available tools
- **Trajectory Analysis** - Efficiency and effectiveness of the agent's path to solution
- **Reasoning Quality** - Quality of the agent's reasoning process
- **Coherence** - Consistency and clarity of the agent's communication

The evaluator generates detailed reports, visualizations, and metrics to analyze agent performance.

```python
evaluator = AgentEvaluator(
model_name="gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_name="gemini-1.5-pro",
model_name="gemini-2.0-flash",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've updated the model_name to gemini-2.0-flash in the code snippet.

Suggested change
model_name="gemini-1.5-pro",
model_name="gemini-2.0-flash",

temperature=0.1,
verbosity=2,
use_weave=True
)

results = evaluator.run_evaluation(agent, eval_dataset)
```

## Getting Started

1. Install dependencies:
```
uv sync
```

2. Set up environment variables (some of these will auto populate if you run `setup.py`):
```
# Create a .env file with your API keys or colab secrets
GEMINI_API_KEY
HUGGING_FACE_HUB_TOKEN
VERTEX_PROJECT_ID
VERTEX_LOCATION
VERTEX_MODEL_ID
VERTEX_ENDPOINT_ID
DEEPSEEK_ENDPOINT_ID
```

3. Generate evaluation dataset:
```python
from dataset_generator import DatasetGenerator, create_customer_support_agent_evaluation_dataset
from customer_support_agent import create_customer_support_agent

agent = create_customer_support_agent()
generator = DatasetGenerator(agent=agent)
examples = create_customer_support_agent_evaluation_dataset(generator, agent)
generator.save_dataset(examples, "evaluation_dataset.json")
```

4. Run evaluation:
```python
from evaluator import AgentEvaluator, load_dataset

eval_dataset = load_dataset("evaluation_dataset.json")
evaluator = AgentEvaluator()
results = evaluator.run_evaluation(agent, eval_dataset)
```

## Features

- **Realistic Data Generation**: Creates synthetic but realistic customer queries based on e-commerce data
- **Comprehensive Evaluation**: Measures multiple aspects of agent performance
- **Visualization**: Generates plots and tables for analysis
- **Weave Integration**: Tracks experiments and results with Weave
- Logs agent trajectories and evaluation metrics
- Enables experiment comparison across different agent configurations
- Provides interactive dashboards for analyzing agent performance
- Supports versioning of evaluation datasets and results
- Facilitates collaboration through shareable experiment links
- **Configurable Thresholds**: Adjustable quality thresholds for dataset generation

## Weave Integration

The framework leverages Weave for experiment tracking and visualization:

1. **Experiment Tracking**: Each agent run is logged as a Weave experiment with detailed metrics
2. **Trajectory Visualization**: Agent trajectories are visualized step-by-step for analysis
3. **Comparative Analysis**: Compare performance across different agent configurations and models
4. **Custom Dashboards**: Create custom dashboards to monitor specific metrics
5. **Artifact Management**: Store and version datasets, agent configurations, and evaluation results

```python
# Enable Weave logging in agent creation
agent = create_customer_support_agent(
model_id="google/gemini-1.5-pro",
use_weave=True, # Enable Weave logging
temperature=0.2
)

# Enable Weave in evaluator
evaluator = AgentEvaluator(
model_name="gemini-1.5-pro",
use_weave=True, # Enable Weave logging
verbosity=2
)
```

## Requirements

- Python 3.8+
- Vertex AI API access
- [Weights & Biases account](https://wandb.ai)
- Required Python packages (see pyproject.toml)

## Contributors

- [Anish Shah](https://github.com/ash0ts)
15 changes: 15 additions & 0 deletions gemini/evaluation/synthetic-data-evals/clean.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash

# Find and clean all Python files in main folder, excluding .venv and other hidden directories
echo "Cleaning Python files..."
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs black
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs isort

# Find and clean all Jupyter notebook files in main folder, excluding hidden directories
echo "Cleaning Jupyter notebook files..."
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa black
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa isort

# Run nox format session for any remaining files
echo "Running final format check..."
nox -s format
1 change: 1 addition & 0 deletions gemini/evaluation/synthetic-data-evals/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
WEAVE_PROJECT_NAME = "agent_evaluation_workshop"
Loading
Loading