Skip to content

feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 30, 2025

Conversation

ash0ts
Copy link
Contributor

@ash0ts ash0ts commented Mar 7, 2025

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
    • Your account is listed in CODEOWNERS for the file(s).
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).
  • [] Appropriate docs were updated (if necessary)

@ash0ts ash0ts requested a review from a team as a code owner March 7, 2025 19:08
Copy link

google-cla bot commented Mar 7, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ash0ts, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a framework for generating synthetic data and evaluating agents, particularly in customer support scenarios. It leverages Weights & Biases (W&B) Weave for experiment tracking and Vertex AI for model serving. The framework includes components for dataset generation, agent evaluation, and analysis, with a focus on realistic data generation and comprehensive evaluation metrics.

Highlights

  • Synthetic Data Generation: Implements a dataset generator that creates realistic e-commerce customer queries, runs agents on these queries, evaluates the agent's trajectory, and filters examples based on quality thresholds.
  • Agent Evaluation Framework: Introduces a comprehensive evaluation system for measuring agent performance, including response correctness, tool selection, trajectory analysis, reasoning quality, and coherence. It generates detailed reports, visualizations, and metrics.
  • Weave Integration: Integrates W&B Weave for experiment tracking and visualization, logging agent trajectories, evaluation metrics, and enabling experiment comparison across different agent configurations.
  • Customer Support Agent: Creates a customer support agent using the smolagents framework, equipped with tools for handling e-commerce customer queries such as product search, order status, category browsing, price check, and customer order history.

Changelog

Click here to see the changelog
  • gemini/evaluation/synthetic-data-evals/.gitignore
    • Added .gitignore file to exclude data, evaluation results, and PNG files from version control.
  • gemini/evaluation/synthetic-data-evals/README.md
    • Created a detailed README file explaining the agent evaluation framework, its components, features, getting started guide, and Weave integration.
  • gemini/evaluation/synthetic-data-evals/clean.sh
    • Added a shell script to clean Python and Jupyter notebook files using nox -s format.
  • gemini/evaluation/synthetic-data-evals/config.py
    • Created a configuration file to define the Weave project name.
  • gemini/evaluation/synthetic-data-evals/customer_support_agent.py
    • Implemented the customer support agent with tools for product search, order status, category browsing, price check, and customer order history.
    • Added functions to create the agent and load realistic datasets from Kaggle's Amazon Reviews Multi dataset.
  • gemini/evaluation/synthetic-data-evals/customer_support_eval.json
    • Added a JSON file containing evaluation examples for the customer support agent.
  • gemini/evaluation/synthetic-data-evals/dataset_generator.py
    • Implemented the DatasetGenerator class for creating synthetic evaluation datasets, including judging final responses, individual steps, and overall trajectories.
    • Added functions to generate realistic e-commerce prompts and save the generated dataset to a JSON file and Weave.
  • gemini/evaluation/synthetic-data-evals/evaluator.py
    • Implemented the AgentEvaluator class for running evaluations on agents using Vertex AI Evaluation, defining custom metrics, and generating visualizations.
    • Added functions to format the dataset for evaluation, extract tool usage and reasoning from agent memory, and generate summary tables and plots.
  • gemini/evaluation/synthetic-data-evals/noxfile.py
    • Updated the Nox configuration file to include linting, formatting, unit tests, system tests, and documentation generation.
  • gemini/evaluation/synthetic-data-evals/pyproject.toml
    • Added project metadata and dependencies for the synthetic data evaluation framework.
  • gemini/evaluation/synthetic-data-evals/render_evals.py
    • Implemented functions to render model comparisons, difficulty analysis, temperature analysis, and a comprehensive conclusion with key findings from the evaluation results.
  • gemini/evaluation/synthetic-data-evals/setup.py
    • Added a setup script to configure Vertex AI resources, including registering the DeepSeek model and creating endpoints, and updating the .env file with the necessary environment variables.
  • gemini/evaluation/synthetic-data-evals/vertex_model.py
    • Implemented the VertexAIServerModel class for connecting to a Vertex AI-compatible API server, including authentication, token refresh, and prediction methods.
    • Added the WeaveVertexAIServerModel class for tracking Vertex AI models with Weave.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Did you know?

The first AI program, Logic Theorist, was created in 1956 by Allen Newell, Cliff Shaw, and Herbert Simon, and it was designed to prove theorems in symbolic logic.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework for generating and evaluating LLM-powered agents in customer support scenarios, integrating with W&B Weave and Vertex AI. The addition of synthetic data generation and evaluation capabilities significantly enhances the project. The code is well-structured and includes detailed documentation. However, there are areas where improvements can be made to enhance maintainability, readability, and error handling.

Summary of Findings

  • Error Handling in Data Loading: The load_realistic_datasets function in customer_support_agent.py includes a try-except block for downloading and loading the Amazon Reviews Multi dataset. While this is good, the except block currently only prints an error message and raises the exception. It would be beneficial to implement a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available.
  • Metadata Generation: The _generate_metadata function in dataset_generator.py generates metadata about the agent's trajectory, such as the number of steps and tools used. This metadata is valuable for analysis and experiment tracking. However, the function could be extended to include additional information, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.
  • Weave Logging: The framework leverages Weave for experiment tracking and visualization. However, the Weave logging functionality could be enhanced to provide more detailed information about the agent's internal state. For example, the agent's memory, the current step number, and the available tools could be logged at each step. This would provide a more granular view of the agent's decision-making process and allow for more detailed analysis.

Merge Readiness

The pull request is not quite ready for merging. There are some high and medium severity issues that should be addressed before merging. I am unable to approve this pull request, and users should have others review and approve this code before merging.

Comment on lines 308 to 312
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding more metadata, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.

Suggested change
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0
}
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0,
"average_tool_latency": np.mean(tool_latencies) if tool_latencies else 0,
"memory_usage": memory_usage
}

@ash0ts
Copy link
Contributor Author

ash0ts commented Mar 7, 2025

@inardini Here is the PR!

@holtskinner
Copy link
Collaborator

@ash0ts Closing due to inactivity. Please re-open once the CLA is signed

@holtskinner holtskinner closed this Apr 3, 2025
@ash0ts
Copy link
Contributor Author

ash0ts commented Apr 18, 2025

Hello! I signed the CLA

@ash0ts
Copy link
Contributor Author

ash0ts commented Apr 18, 2025

I don't have the option to Re-open this branch

@inardini inardini reopened this Apr 18, 2025
@ash0ts
Copy link
Contributor Author

ash0ts commented Apr 25, 2025

@holtskinner @inardini I signed the CLA and it seems to not have reflected here blocking a merge. Anything I can do?

@ash0ts
Copy link
Contributor Author

ash0ts commented Apr 30, 2025

Wanted to bump this!

@holtskinner
Copy link
Collaborator

@ash0ts You have to re-request a CLA scan or push a new commit, I did that manually.

Copy link
Collaborator

@holtskinner holtskinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Sorry for the delay in reviewing, missed your comment about signing the CLA.

Overall, I'd like to see this simplified to be easier to follow/understand.

If any of the code for connecting Vertex AI to Weave could be added to the main weave library, I'd like to see that get merged/released before publishing this example to the repository. (Or at least put that library code in a separate folder for organization)


```python
agent = create_customer_support_agent(
model_id="google/gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_id="google/gemini-1.5-pro",
model_id="google/gemini-2.0-flash",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While gemini-2.0-flash is a valid model ID, it's important to ensure that the model is available and suitable for the intended use case within the create_customer_support_agent function. It would be good to verify that this model is the most appropriate choice for this specific agent and evaluation scenario.

Suggested change
model_id="google/gemini-1.5-pro",
model_id="google/gemini-2.0-flash",

```python
generator = DatasetGenerator(
agent=agent,
judge_model="gemini/gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
judge_model="gemini/gemini-1.5-pro",
judge_model="gemini/gemini-2.0-flash",

Do the same throughout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that consistency in model naming is important for clarity. I suggest applying this change throughout the document wherever gemini-1.5-pro is used for the judge model.

Suggested change
judge_model="gemini/gemini-1.5-pro",
judge_model="gemini/gemini-2.0-flash",


```python
evaluator = AgentEvaluator(
model_name="gemini-1.5-pro",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model_name="gemini-1.5-pro",
model_name="gemini-2.0-flash",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I've updated the model_name to gemini-2.0-flash in the code snippet.

Suggested change
model_name="gemini-1.5-pro",
model_name="gemini-2.0-flash",

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have a Jupyter notebook showing how to run this, can you change this README into a notebook showing how this can be done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is needed. nox -s format from the repository root should handle all of this.

if tool_name:
tool_names_from_steps.append(tool_name)

# Return both the Vertex score and the actual tool usage for analysis
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Return both the Vertex score and the actual tool usage for analysis
# Return both the Vertex AI score and the actual tool usage for analysis

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can probably be deleted

import seaborn as sns


def render_model_comparison(all_results, console=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add parameter typing wherever possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the repo link to be for the main repo.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can any of the code adding Vertex AI support into weave be added to the actual weave library?

@holtskinner
Copy link
Collaborator

Fixed most of the spelling errors, but there are still several lint errors that need resolved.

@holtskinner holtskinner merged commit eea839c into GoogleCloudPlatform:main May 30, 2025
4 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants