-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807
base: main
Are you sure you want to change the base?
Conversation
comment out broken parts NOTE: _ipython_utils in aiplatform package need to change the core ipython import for display to just ipython.display
add save_to_weave
fix links for colab assuming merge to main
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @ash0ts, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces a framework for generating synthetic data and evaluating agents, particularly in customer support scenarios. It leverages Weights & Biases (W&B) Weave for experiment tracking and Vertex AI for model serving. The framework includes components for dataset generation, agent evaluation, and analysis, with a focus on realistic data generation and comprehensive evaluation metrics.
Highlights
- Synthetic Data Generation: Implements a dataset generator that creates realistic e-commerce customer queries, runs agents on these queries, evaluates the agent's trajectory, and filters examples based on quality thresholds.
- Agent Evaluation Framework: Introduces a comprehensive evaluation system for measuring agent performance, including response correctness, tool selection, trajectory analysis, reasoning quality, and coherence. It generates detailed reports, visualizations, and metrics.
- Weave Integration: Integrates W&B Weave for experiment tracking and visualization, logging agent trajectories, evaluation metrics, and enabling experiment comparison across different agent configurations.
- Customer Support Agent: Creates a customer support agent using the
smolagents
framework, equipped with tools for handling e-commerce customer queries such as product search, order status, category browsing, price check, and customer order history.
Changelog
Click here to see the changelog
- gemini/evaluation/synthetic-data-evals/.gitignore
- Added .gitignore file to exclude data, evaluation results, and PNG files from version control.
- gemini/evaluation/synthetic-data-evals/README.md
- Created a detailed README file explaining the agent evaluation framework, its components, features, getting started guide, and Weave integration.
- gemini/evaluation/synthetic-data-evals/clean.sh
- Added a shell script to clean Python and Jupyter notebook files using
nox -s format
.
- Added a shell script to clean Python and Jupyter notebook files using
- gemini/evaluation/synthetic-data-evals/config.py
- Created a configuration file to define the Weave project name.
- gemini/evaluation/synthetic-data-evals/customer_support_agent.py
- Implemented the customer support agent with tools for product search, order status, category browsing, price check, and customer order history.
- Added functions to create the agent and load realistic datasets from Kaggle's Amazon Reviews Multi dataset.
- gemini/evaluation/synthetic-data-evals/customer_support_eval.json
- Added a JSON file containing evaluation examples for the customer support agent.
- gemini/evaluation/synthetic-data-evals/dataset_generator.py
- Implemented the DatasetGenerator class for creating synthetic evaluation datasets, including judging final responses, individual steps, and overall trajectories.
- Added functions to generate realistic e-commerce prompts and save the generated dataset to a JSON file and Weave.
- gemini/evaluation/synthetic-data-evals/evaluator.py
- Implemented the AgentEvaluator class for running evaluations on agents using Vertex AI Evaluation, defining custom metrics, and generating visualizations.
- Added functions to format the dataset for evaluation, extract tool usage and reasoning from agent memory, and generate summary tables and plots.
- gemini/evaluation/synthetic-data-evals/noxfile.py
- Updated the Nox configuration file to include linting, formatting, unit tests, system tests, and documentation generation.
- gemini/evaluation/synthetic-data-evals/pyproject.toml
- Added project metadata and dependencies for the synthetic data evaluation framework.
- gemini/evaluation/synthetic-data-evals/render_evals.py
- Implemented functions to render model comparisons, difficulty analysis, temperature analysis, and a comprehensive conclusion with key findings from the evaluation results.
- gemini/evaluation/synthetic-data-evals/setup.py
- Added a setup script to configure Vertex AI resources, including registering the DeepSeek model and creating endpoints, and updating the .env file with the necessary environment variables.
- gemini/evaluation/synthetic-data-evals/vertex_model.py
- Implemented the VertexAIServerModel class for connecting to a Vertex AI-compatible API server, including authentication, token refresh, and prediction methods.
- Added the WeaveVertexAIServerModel class for tracking Vertex AI models with Weave.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Did you know?
The first AI program, Logic Theorist, was created in 1956 by Allen Newell, Cliff Shaw, and Herbert Simon, and it was designed to prove theorems in symbolic logic.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive framework for generating and evaluating LLM-powered agents in customer support scenarios, integrating with W&B Weave and Vertex AI. The addition of synthetic data generation and evaluation capabilities significantly enhances the project. The code is well-structured and includes detailed documentation. However, there are areas where improvements can be made to enhance maintainability, readability, and error handling.
Summary of Findings
- Error Handling in Data Loading: The
load_realistic_datasets
function incustomer_support_agent.py
includes atry-except
block for downloading and loading the Amazon Reviews Multi dataset. While this is good, theexcept
block currently only prints an error message and raises the exception. It would be beneficial to implement a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available. - Metadata Generation: The
_generate_metadata
function indataset_generator.py
generates metadata about the agent's trajectory, such as the number of steps and tools used. This metadata is valuable for analysis and experiment tracking. However, the function could be extended to include additional information, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption. - Weave Logging: The framework leverages Weave for experiment tracking and visualization. However, the Weave logging functionality could be enhanced to provide more detailed information about the agent's internal state. For example, the agent's memory, the current step number, and the available tools could be logged at each step. This would provide a more granular view of the agent's decision-making process and allow for more detailed analysis.
Merge Readiness
The pull request is not quite ready for merging. There are some high and medium severity issues that should be addressed before merging. I am unable to approve this pull request, and users should have others review and approve this code before merging.
except Exception as e: | ||
print(f"Error loading dataset from Kaggle: {str(e)}") | ||
raise # Raise the exception instead of falling back to a simpler dataset | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available.
except Exception as e: | |
print(f"Error loading dataset from Kaggle: {str(e)}") | |
raise # Raise the exception instead of falling back to a simpler dataset | |
else: | |
except Exception as e: | |
print(f"Error loading dataset from Kaggle: {str(e)}") | |
# Fallback to a simpler dataset | |
products = pd.DataFrame([{'product_id': 'P1', 'name': 'Simple Product', 'category': 'Simple', 'price': 10.0, 'stock': 100, 'description': 'A basic product', 'warranty': 'None', 'return_period': 7}]) | |
orders = pd.DataFrame([{'order_id': 'O1', 'customer_id': 'C1', 'product_id': 'P1', 'status': 'delivered', 'order_date': '2024-01-01', 'delivery_date': '2024-01-05', 'estimated_delivery_date': None, 'tracking_number': None}]) | |
products.to_csv(products_path, index=False) | |
orders.to_csv(orders_path, index=False) | |
print("Falling back to a simpler dataset") |
return { | ||
"num_steps": len(trajectory.steps), | ||
"tools_used": list(set(tools_used)), # Remove duplicates | ||
"has_planning": self.agent.planning_interval > 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding more metadata, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.
return { | |
"num_steps": len(trajectory.steps), | |
"tools_used": list(set(tools_used)), # Remove duplicates | |
"has_planning": self.agent.planning_interval > 0 | |
} | |
return { | |
"num_steps": len(trajectory.steps), | |
"tools_used": list(set(tools_used)), # Remove duplicates | |
"has_planning": self.agent.planning_interval > 0, | |
"average_tool_latency": np.mean(tool_latencies) if tool_latencies else 0, | |
"memory_usage": memory_usage | |
} |
@inardini Here is the PR! |
Description
Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
CONTRIBUTING
Guide.CODEOWNERS
for the file(s).nox -s format
from the repository root to format).