-
Notifications
You must be signed in to change notification settings - Fork 3.2k
feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
62f6f54
387c10a
5444041
d3186b5
793df72
258a1bf
d8c863b
80e8f0a
7630524
acb35f6
dd4bdb7
0299485
eb3eeb4
1361186
6b05ebb
e838f04
52f1432
069a11f
d44323f
ed660b4
e2095f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
data | ||
evaluation_results | ||
*.png |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,170 @@ | ||||||||||
# Agent Evaluation Framework | ||||||||||
|
||||||||||
This repository contains a framework for generating, evaluating, and analyzing the performance of LLM-powered agents in customer support scenarios. | ||||||||||
|
||||||||||
## Overview | ||||||||||
|
||||||||||
The framework consists of three main components: | ||||||||||
|
||||||||||
1. **Customer Support Agent** - An LLM-powered agent with tools for handling e-commerce customer queries | ||||||||||
2. **Dataset Generator** - A system for creating synthetic evaluation datasets with realistic customer queries | ||||||||||
3. **Agent Evaluator** - A comprehensive evaluation system for measuring agent performance | ||||||||||
|
||||||||||
## Customer Support Agent | ||||||||||
|
||||||||||
The customer support agent is built using the `smolagents` framework and provides several tools for handling e-commerce queries: | ||||||||||
|
||||||||||
- `ProductSearchTool` - Search product catalog by name, category, or description | ||||||||||
- `OrderStatusTool` - Check order status by order ID | ||||||||||
- `CategoryBrowseTool` - Browse products by category | ||||||||||
- `PriceCheckTool` - Check product price by product ID | ||||||||||
- `CustomerOrderHistoryTool` - Get order history for a customer | ||||||||||
|
||||||||||
The agent can be configured with different LLM models, including Gemini 1.5 Pro, and supports planning capabilities to handle complex multi-step queries. | ||||||||||
|
||||||||||
```python | ||||||||||
agent = create_customer_support_agent( | ||||||||||
model_id="google/gemini-1.5-pro", | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While
Suggested change
|
||||||||||
use_weave=True, | ||||||||||
temperature=0.2, | ||||||||||
planning_interval=1, | ||||||||||
max_steps=3 | ||||||||||
) | ||||||||||
``` | ||||||||||
|
||||||||||
## Dataset Generator | ||||||||||
|
||||||||||
The dataset generator creates realistic evaluation examples by: | ||||||||||
|
||||||||||
1. Generating diverse e-commerce customer queries | ||||||||||
2. Running the agent on these queries and recording its trajectory | ||||||||||
3. Evaluating each step and the final response using a judge model | ||||||||||
4. Filtering examples based on quality thresholds | ||||||||||
5. Saving high-quality examples to a dataset for evaluation | ||||||||||
|
||||||||||
```python | ||||||||||
generator = DatasetGenerator( | ||||||||||
agent=agent, | ||||||||||
judge_model="gemini/gemini-1.5-pro", | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Do the same throughout There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I agree that consistency in model naming is important for clarity. I suggest applying this change throughout the document wherever
Suggested change
|
||||||||||
thresholds={ | ||||||||||
"final_response": 0.7, | ||||||||||
"single_step": 0.7, | ||||||||||
"trajectory": 0.7 | ||||||||||
}, | ||||||||||
debug=True | ||||||||||
) | ||||||||||
|
||||||||||
examples = create_customer_support_agent_evaluation_dataset(generator, agent, num_prompts=10) | ||||||||||
``` | ||||||||||
|
||||||||||
## Agent Evaluator | ||||||||||
|
||||||||||
The evaluator provides comprehensive metrics for agent performance: | ||||||||||
|
||||||||||
- **Response Correctness** - Accuracy and completeness of the agent's final response | ||||||||||
- **Tool Selection** - Appropriate use of available tools | ||||||||||
- **Trajectory Analysis** - Efficiency and effectiveness of the agent's path to solution | ||||||||||
- **Reasoning Quality** - Quality of the agent's reasoning process | ||||||||||
- **Coherence** - Consistency and clarity of the agent's communication | ||||||||||
|
||||||||||
The evaluator generates detailed reports, visualizations, and metrics to analyze agent performance. | ||||||||||
|
||||||||||
```python | ||||||||||
evaluator = AgentEvaluator( | ||||||||||
model_name="gemini-1.5-pro", | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the suggestion! I've updated the
Suggested change
|
||||||||||
temperature=0.1, | ||||||||||
verbosity=2, | ||||||||||
use_weave=True | ||||||||||
) | ||||||||||
|
||||||||||
results = evaluator.run_evaluation(agent, eval_dataset) | ||||||||||
``` | ||||||||||
|
||||||||||
## Getting Started | ||||||||||
|
||||||||||
1. Install dependencies: | ||||||||||
``` | ||||||||||
uv sync | ||||||||||
``` | ||||||||||
|
||||||||||
2. Set up environment variables (some of these will auto populate if you run `setup.py`): | ||||||||||
``` | ||||||||||
# Create a .env file with your API keys or colab secrets | ||||||||||
GEMINI_API_KEY | ||||||||||
HUGGING_FACE_HUB_TOKEN | ||||||||||
VERTEX_PROJECT_ID | ||||||||||
VERTEX_LOCATION | ||||||||||
VERTEX_MODEL_ID | ||||||||||
VERTEX_ENDPOINT_ID | ||||||||||
DEEPSEEK_ENDPOINT_ID | ||||||||||
``` | ||||||||||
|
||||||||||
3. Generate evaluation dataset: | ||||||||||
```python | ||||||||||
from dataset_generator import DatasetGenerator, create_customer_support_agent_evaluation_dataset | ||||||||||
from customer_support_agent import create_customer_support_agent | ||||||||||
|
||||||||||
agent = create_customer_support_agent() | ||||||||||
generator = DatasetGenerator(agent=agent) | ||||||||||
examples = create_customer_support_agent_evaluation_dataset(generator, agent) | ||||||||||
generator.save_dataset(examples, "evaluation_dataset.json") | ||||||||||
``` | ||||||||||
|
||||||||||
4. Run evaluation: | ||||||||||
```python | ||||||||||
from evaluator import AgentEvaluator, load_dataset | ||||||||||
|
||||||||||
eval_dataset = load_dataset("evaluation_dataset.json") | ||||||||||
evaluator = AgentEvaluator() | ||||||||||
results = evaluator.run_evaluation(agent, eval_dataset) | ||||||||||
``` | ||||||||||
|
||||||||||
## Features | ||||||||||
|
||||||||||
- **Realistic Data Generation**: Creates synthetic but realistic customer queries based on e-commerce data | ||||||||||
- **Comprehensive Evaluation**: Measures multiple aspects of agent performance | ||||||||||
- **Visualization**: Generates plots and tables for analysis | ||||||||||
- **Weave Integration**: Tracks experiments and results with Weave | ||||||||||
- Logs agent trajectories and evaluation metrics | ||||||||||
- Enables experiment comparison across different agent configurations | ||||||||||
- Provides interactive dashboards for analyzing agent performance | ||||||||||
- Supports versioning of evaluation datasets and results | ||||||||||
- Facilitates collaboration through shareable experiment links | ||||||||||
- **Configurable Thresholds**: Adjustable quality thresholds for dataset generation | ||||||||||
|
||||||||||
## Weave Integration | ||||||||||
|
||||||||||
The framework leverages Weave for experiment tracking and visualization: | ||||||||||
|
||||||||||
1. **Experiment Tracking**: Each agent run is logged as a Weave experiment with detailed metrics | ||||||||||
2. **Trajectory Visualization**: Agent trajectories are visualized step-by-step for analysis | ||||||||||
3. **Comparative Analysis**: Compare performance across different agent configurations and models | ||||||||||
4. **Custom Dashboards**: Create custom dashboards to monitor specific metrics | ||||||||||
5. **Artifact Management**: Store and version datasets, agent configurations, and evaluation results | ||||||||||
|
||||||||||
```python | ||||||||||
# Enable Weave logging in agent creation | ||||||||||
agent = create_customer_support_agent( | ||||||||||
model_id="google/gemini-1.5-pro", | ||||||||||
use_weave=True, # Enable Weave logging | ||||||||||
temperature=0.2 | ||||||||||
) | ||||||||||
|
||||||||||
# Enable Weave in evaluator | ||||||||||
evaluator = AgentEvaluator( | ||||||||||
model_name="gemini-1.5-pro", | ||||||||||
use_weave=True, # Enable Weave logging | ||||||||||
verbosity=2 | ||||||||||
) | ||||||||||
``` | ||||||||||
|
||||||||||
## Requirements | ||||||||||
|
||||||||||
- Python 3.8+ | ||||||||||
- Vertex AI API access | ||||||||||
- [Weights & Biases account](https://wandb.ai) | ||||||||||
- Required Python packages (see pyproject.toml) | ||||||||||
|
||||||||||
## Contributors | ||||||||||
|
||||||||||
- [Anish Shah](https://github.com/ash0ts) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
#!/bin/bash | ||
|
||
# Find and clean all Python files in main folder, excluding .venv and other hidden directories | ||
echo "Cleaning Python files..." | ||
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs black | ||
find . -name "*.py" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs isort | ||
|
||
# Find and clean all Jupyter notebook files in main folder, excluding hidden directories | ||
echo "Cleaning Jupyter notebook files..." | ||
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa black | ||
find . -name "*.ipynb" -type f -not -path "*/\.*" -not -path "*/venv/*" -not -path "*/.venv/*" | xargs nbqa isort | ||
|
||
# Run nox format session for any remaining files | ||
echo "Running final format check..." | ||
nox -s format |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
WEAVE_PROJECT_NAME = "agent_evaluation_workshop" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have a Jupyter notebook showing how to run this, can you change this README into a notebook showing how this can be done