feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

ash0ts · 2025-03-07T19:08:06Z

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Follow the CONTRIBUTING Guide.
You are listed as the author in your notebook or README file.
- Your account is listed in CODEOWNERS for the file(s).
Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
Ensure the tests and linter pass (Run nox -s format from the repository root to format).
[] Appropriate docs were updated (if necessary)

comment out broken parts NOTE: _ipython_utils in aiplatform package need to change the core ipython import for display to just ipython.display

add save_to_weave

fix links for colab assuming merge to main

google-cla · 2025-03-07T19:08:11Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

gemini-code-assist

Hello @ash0ts, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a framework for generating synthetic data and evaluating agents, particularly in customer support scenarios. It leverages Weights & Biases (W&B) Weave for experiment tracking and Vertex AI for model serving. The framework includes components for dataset generation, agent evaluation, and analysis, with a focus on realistic data generation and comprehensive evaluation metrics.

Highlights

Synthetic Data Generation: Implements a dataset generator that creates realistic e-commerce customer queries, runs agents on these queries, evaluates the agent's trajectory, and filters examples based on quality thresholds.
Agent Evaluation Framework: Introduces a comprehensive evaluation system for measuring agent performance, including response correctness, tool selection, trajectory analysis, reasoning quality, and coherence. It generates detailed reports, visualizations, and metrics.
Weave Integration: Integrates W&B Weave for experiment tracking and visualization, logging agent trajectories, evaluation metrics, and enabling experiment comparison across different agent configurations.
Customer Support Agent: Creates a customer support agent using the smolagents framework, equipped with tools for handling e-commerce customer queries such as product search, order status, category browsing, price check, and customer order history.

Changelog

Click here to see the changelog

gemini/evaluation/synthetic-data-evals/.gitignore
- Added .gitignore file to exclude data, evaluation results, and PNG files from version control.
gemini/evaluation/synthetic-data-evals/README.md
- Created a detailed README file explaining the agent evaluation framework, its components, features, getting started guide, and Weave integration.
gemini/evaluation/synthetic-data-evals/clean.sh
- Added a shell script to clean Python and Jupyter notebook files using nox -s format.
gemini/evaluation/synthetic-data-evals/config.py
- Created a configuration file to define the Weave project name.
gemini/evaluation/synthetic-data-evals/customer_support_agent.py
- Implemented the customer support agent with tools for product search, order status, category browsing, price check, and customer order history.
- Added functions to create the agent and load realistic datasets from Kaggle's Amazon Reviews Multi dataset.
gemini/evaluation/synthetic-data-evals/customer_support_eval.json
- Added a JSON file containing evaluation examples for the customer support agent.
gemini/evaluation/synthetic-data-evals/dataset_generator.py
- Implemented the DatasetGenerator class for creating synthetic evaluation datasets, including judging final responses, individual steps, and overall trajectories.
- Added functions to generate realistic e-commerce prompts and save the generated dataset to a JSON file and Weave.
gemini/evaluation/synthetic-data-evals/evaluator.py
- Implemented the AgentEvaluator class for running evaluations on agents using Vertex AI Evaluation, defining custom metrics, and generating visualizations.
- Added functions to format the dataset for evaluation, extract tool usage and reasoning from agent memory, and generate summary tables and plots.
gemini/evaluation/synthetic-data-evals/noxfile.py
- Updated the Nox configuration file to include linting, formatting, unit tests, system tests, and documentation generation.
gemini/evaluation/synthetic-data-evals/pyproject.toml
- Added project metadata and dependencies for the synthetic data evaluation framework.
gemini/evaluation/synthetic-data-evals/render_evals.py
- Implemented functions to render model comparisons, difficulty analysis, temperature analysis, and a comprehensive conclusion with key findings from the evaluation results.
gemini/evaluation/synthetic-data-evals/setup.py
- Added a setup script to configure Vertex AI resources, including registering the DeepSeek model and creating endpoints, and updating the .env file with the necessary environment variables.
gemini/evaluation/synthetic-data-evals/vertex_model.py
- Implemented the VertexAIServerModel class for connecting to a Vertex AI-compatible API server, including authentication, token refresh, and prediction methods.
- Added the WeaveVertexAIServerModel class for tracking Vertex AI models with Weave.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Did you know?

The first AI program, Logic Theorist, was created in 1956 by Allen Newell, Cliff Shaw, and Herbert Simon, and it was designed to prove theorems in symbolic logic.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a comprehensive framework for generating and evaluating LLM-powered agents in customer support scenarios, integrating with W&B Weave and Vertex AI. The addition of synthetic data generation and evaluation capabilities significantly enhances the project. The code is well-structured and includes detailed documentation. However, there are areas where improvements can be made to enhance maintainability, readability, and error handling.

Summary of Findings

Error Handling in Data Loading: The load_realistic_datasets function in customer_support_agent.py includes a try-except block for downloading and loading the Amazon Reviews Multi dataset. While this is good, the except block currently only prints an error message and raises the exception. It would be beneficial to implement a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available.
Metadata Generation: The _generate_metadata function in dataset_generator.py generates metadata about the agent's trajectory, such as the number of steps and tools used. This metadata is valuable for analysis and experiment tracking. However, the function could be extended to include additional information, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.
Weave Logging: The framework leverages Weave for experiment tracking and visualization. However, the Weave logging functionality could be enhanced to provide more detailed information about the agent's internal state. For example, the agent's memory, the current step number, and the available tools could be logged at each step. This would provide a more granular view of the agent's decision-making process and allow for more detailed analysis.

Merge Readiness

The pull request is not quite ready for merging. There are some high and medium severity issues that should be addressed before merging. I am unable to approve this pull request, and users should have others review and approve this code before merging.

gemini/evaluation/synthetic-data-evals/customer_support_agent.py

gemini-code-assist · 2025-03-07T19:09:54Z

gemini/evaluation/synthetic-data-evals/dataset_generator.py

+        return {
+            "num_steps": len(trajectory.steps),
+            "tools_used": list(set(tools_used)),  # Remove duplicates
+            "has_planning": self.agent.planning_interval > 0
+        }


Consider adding more metadata, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.

Suggested change

return {

"num_steps": len(trajectory.steps),

"tools_used": list(set(tools_used)), # Remove duplicates

"has_planning": self.agent.planning_interval > 0

}

return {

"num_steps": len(trajectory.steps),

"tools_used": list(set(tools_used)), # Remove duplicates

"has_planning": self.agent.planning_interval > 0,

"average_tool_latency": np.mean(tool_latencies) if tool_latencies else 0,

"memory_usage": memory_usage

}

ash0ts · 2025-03-07T19:12:32Z

@inardini Here is the PR!

holtskinner · 2025-04-03T19:09:33Z

@ash0ts Closing due to inactivity. Please re-open once the CLA is signed

ash0ts · 2025-04-18T18:04:38Z

Hello! I signed the CLA

ash0ts · 2025-04-18T18:05:18Z

I don't have the option to Re-open this branch

ash0ts · 2025-04-25T07:20:57Z

@holtskinner @inardini I signed the CLA and it seems to not have reflected here blocking a merge. Anything I can do?

ash0ts · 2025-04-30T17:53:31Z

Wanted to bump this!

holtskinner · 2025-05-06T15:26:36Z

@ash0ts You have to re-request a CLA scan or push a new commit, I did that manually.

…ive-ai into pr/ash0ts/1807

holtskinner

Thanks for the contribution! Sorry for the delay in reviewing, missed your comment about signing the CLA.

Overall, I'd like to see this simplified to be easier to follow/understand.

If any of the code for connecting Vertex AI to Weave could be added to the main weave library, I'd like to see that get merged/released before publishing this example to the repository. (Or at least put that library code in a separate folder for organization)

holtskinner · 2025-05-06T15:28:52Z

gemini/evaluation/synthetic-data-evals/README.md

+
+```python
+agent = create_customer_support_agent(
+    model_id="google/gemini-1.5-pro",


Suggested change

model_id="google/gemini-1.5-pro",

model_id="google/gemini-2.0-flash",

While gemini-2.0-flash is a valid model ID, it's important to ensure that the model is available and suitable for the intended use case within the create_customer_support_agent function. It would be good to verify that this model is the most appropriate choice for this specific agent and evaluation scenario.

Suggested change

model_id="google/gemini-1.5-pro",

model_id="google/gemini-2.0-flash",

holtskinner · 2025-05-06T15:29:05Z

gemini/evaluation/synthetic-data-evals/README.md

+```python
+generator = DatasetGenerator(
+    agent=agent,
+    judge_model="gemini/gemini-1.5-pro",


Suggested change

judge_model="gemini/gemini-1.5-pro",

judge_model="gemini/gemini-2.0-flash",

Do the same throughout

Yes, I agree that consistency in model naming is important for clarity. I suggest applying this change throughout the document wherever gemini-1.5-pro is used for the judge model.

Suggested change

judge_model="gemini/gemini-1.5-pro",

judge_model="gemini/gemini-2.0-flash",

holtskinner · 2025-05-06T15:29:12Z

gemini/evaluation/synthetic-data-evals/README.md

+
+```python
+evaluator = AgentEvaluator(
+    model_name="gemini-1.5-pro",


Suggested change

model_name="gemini-1.5-pro",

model_name="gemini-2.0-flash",

Thanks for the suggestion! I've updated the model_name to gemini-2.0-flash in the code snippet.

Suggested change

model_name="gemini-1.5-pro",

model_name="gemini-2.0-flash",

holtskinner · 2025-05-06T15:29:53Z

gemini/evaluation/synthetic-data-evals/README.md

Since you have a Jupyter notebook showing how to run this, can you change this README into a notebook showing how this can be done

holtskinner · 2025-05-06T15:30:26Z

gemini/evaluation/synthetic-data-evals/clean.sh

Don't think this is needed. nox -s format from the repository root should handle all of this.

holtskinner · 2025-05-06T15:33:10Z

gemini/evaluation/synthetic-data-evals/evaluator.py

+                        if tool_name:
+                            tool_names_from_steps.append(tool_name)
+
+                # Return both the Vertex score and the actual tool usage for analysis


Suggested change

# Return both the Vertex score and the actual tool usage for analysis

# Return both the Vertex AI score and the actual tool usage for analysis

holtskinner · 2025-05-06T15:33:38Z

gemini/evaluation/synthetic-data-evals/noxfile.py

This file can probably be deleted

holtskinner · 2025-05-06T15:33:57Z

gemini/evaluation/synthetic-data-evals/render_evals.py

+import seaborn as sns
+
+
+def render_model_comparison(all_results, console=None):


Can you add parameter typing wherever possible?

holtskinner · 2025-05-06T15:34:37Z

gemini/evaluation/synthetic-data-evals/run.ipynb

Change the repo link to be for the main repo.

holtskinner · 2025-05-06T15:35:35Z

gemini/evaluation/synthetic-data-evals/vertex_model.py

Can any of the code adding Vertex AI support into weave be added to the actual weave library?

holtskinner · 2025-05-06T15:39:41Z

Fixed most of the spelling errors, but there are still several lint errors that need resolved.

ash0ts added 16 commits February 28, 2025 10:04

commit what i have

62f6f54

commit working python files

387c10a

add initial notebook

5444041

Update run.ipynb

d3186b5

comment out broken parts NOTE: _ipython_utils in aiplatform package need to change the core ipython import for display to just ipython.display

fix

793df72

fix evaluator to make it work with weave and make working run.ipynb

258a1bf

fix evaluator naming and metrics for good eval view

d8c863b

prep notebook for public consumption

80e8f0a

Update dataset_generator.py

7630524

add save_to_weave

ensure working flow

acb35f6

clean notebook and split off visualizations into another file

dd4bdb7

remove pngs and fix config

0299485

Update README.md

eb3eeb4

Update run.ipynb

1361186

fix links for colab assuming merge to main

Update README.md

6b05ebb

clean using nox

e838f04

ash0ts requested a review from a team as a code owner March 7, 2025 19:08

gemini-code-assist bot reviewed Mar 7, 2025

View reviewed changes

gemini-code-assist bot suggested changes Mar 7, 2025

View reviewed changes

clean using black and isort

52f1432

Formatting

069a11f

holtskinner assigned ash0ts Mar 20, 2025

holtskinner closed this Apr 3, 2025

inardini reopened this Apr 18, 2025

holtskinner added 2 commits May 6, 2025 10:28

Merge branch 'main' of https://github.com/GoogleCloudPlatform/generat…

d44323f

…ive-ai into pr/ash0ts/1807

Remove spelling issue for kaggle

ed660b4

holtskinner requested changes May 6, 2025

View reviewed changes

Spelling

e2095f2

holtskinner merged commit eea839c into GoogleCloudPlatform:main May 30, 2025
4 of 6 checks passed

	model_id="google/gemini-1.5-pro",
	model_id="google/gemini-2.0-flash",

	judge_model="gemini/gemini-1.5-pro",
	judge_model="gemini/gemini-2.0-flash",

	# Return both the Vertex score and the actual tool usage for analysis
	# Return both the Vertex AI score and the actual tool usage for analysis

		import seaborn as sns


		def render_model_comparison(all_results, console=None):

feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

Uh oh!

Conversation

ash0ts commented Mar 7, 2025

Description

Uh oh!

google-cla bot commented Mar 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

Uh oh!

gemini-code-assist bot Mar 7, 2025

Choose a reason for hiding this comment

Uh oh!

ash0ts commented Mar 7, 2025

Uh oh!

holtskinner commented Apr 3, 2025

Uh oh!

ash0ts commented Apr 18, 2025

Uh oh!

ash0ts commented Apr 18, 2025

Uh oh!

ash0ts commented Apr 25, 2025

Uh oh!

ash0ts commented Apr 30, 2025

Uh oh!

holtskinner commented May 6, 2025

Uh oh!

holtskinner left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot May 6, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

holtskinner commented May 6, 2025

Uh oh!

Uh oh!

Uh oh!