Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Synthetic Data Gen and Evals for Agents using W&B Weave + Vertex #1807

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

ash0ts
Copy link

@ash0ts ash0ts commented Mar 7, 2025

Description

Thank you for opening a Pull Request!
Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Follow the CONTRIBUTING Guide.
  • You are listed as the author in your notebook or README file.
    • Your account is listed in CODEOWNERS for the file(s).
  • Make your Pull Request title in the https://www.conventionalcommits.org/ specification.
  • Ensure the tests and linter pass (Run nox -s format from the repository root to format).
  • [] Appropriate docs were updated (if necessary)

@ash0ts ash0ts requested a review from a team as a code owner March 7, 2025 19:08
Copy link

google-cla bot commented Mar 7, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ash0ts, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces a framework for generating synthetic data and evaluating agents, particularly in customer support scenarios. It leverages Weights & Biases (W&B) Weave for experiment tracking and Vertex AI for model serving. The framework includes components for dataset generation, agent evaluation, and analysis, with a focus on realistic data generation and comprehensive evaluation metrics.

Highlights

  • Synthetic Data Generation: Implements a dataset generator that creates realistic e-commerce customer queries, runs agents on these queries, evaluates the agent's trajectory, and filters examples based on quality thresholds.
  • Agent Evaluation Framework: Introduces a comprehensive evaluation system for measuring agent performance, including response correctness, tool selection, trajectory analysis, reasoning quality, and coherence. It generates detailed reports, visualizations, and metrics.
  • Weave Integration: Integrates W&B Weave for experiment tracking and visualization, logging agent trajectories, evaluation metrics, and enabling experiment comparison across different agent configurations.
  • Customer Support Agent: Creates a customer support agent using the smolagents framework, equipped with tools for handling e-commerce customer queries such as product search, order status, category browsing, price check, and customer order history.

Changelog

Click here to see the changelog
  • gemini/evaluation/synthetic-data-evals/.gitignore
    • Added .gitignore file to exclude data, evaluation results, and PNG files from version control.
  • gemini/evaluation/synthetic-data-evals/README.md
    • Created a detailed README file explaining the agent evaluation framework, its components, features, getting started guide, and Weave integration.
  • gemini/evaluation/synthetic-data-evals/clean.sh
    • Added a shell script to clean Python and Jupyter notebook files using nox -s format.
  • gemini/evaluation/synthetic-data-evals/config.py
    • Created a configuration file to define the Weave project name.
  • gemini/evaluation/synthetic-data-evals/customer_support_agent.py
    • Implemented the customer support agent with tools for product search, order status, category browsing, price check, and customer order history.
    • Added functions to create the agent and load realistic datasets from Kaggle's Amazon Reviews Multi dataset.
  • gemini/evaluation/synthetic-data-evals/customer_support_eval.json
    • Added a JSON file containing evaluation examples for the customer support agent.
  • gemini/evaluation/synthetic-data-evals/dataset_generator.py
    • Implemented the DatasetGenerator class for creating synthetic evaluation datasets, including judging final responses, individual steps, and overall trajectories.
    • Added functions to generate realistic e-commerce prompts and save the generated dataset to a JSON file and Weave.
  • gemini/evaluation/synthetic-data-evals/evaluator.py
    • Implemented the AgentEvaluator class for running evaluations on agents using Vertex AI Evaluation, defining custom metrics, and generating visualizations.
    • Added functions to format the dataset for evaluation, extract tool usage and reasoning from agent memory, and generate summary tables and plots.
  • gemini/evaluation/synthetic-data-evals/noxfile.py
    • Updated the Nox configuration file to include linting, formatting, unit tests, system tests, and documentation generation.
  • gemini/evaluation/synthetic-data-evals/pyproject.toml
    • Added project metadata and dependencies for the synthetic data evaluation framework.
  • gemini/evaluation/synthetic-data-evals/render_evals.py
    • Implemented functions to render model comparisons, difficulty analysis, temperature analysis, and a comprehensive conclusion with key findings from the evaluation results.
  • gemini/evaluation/synthetic-data-evals/setup.py
    • Added a setup script to configure Vertex AI resources, including registering the DeepSeek model and creating endpoints, and updating the .env file with the necessary environment variables.
  • gemini/evaluation/synthetic-data-evals/vertex_model.py
    • Implemented the VertexAIServerModel class for connecting to a Vertex AI-compatible API server, including authentication, token refresh, and prediction methods.
    • Added the WeaveVertexAIServerModel class for tracking Vertex AI models with Weave.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Did you know?

The first AI program, Logic Theorist, was created in 1956 by Allen Newell, Cliff Shaw, and Herbert Simon, and it was designed to prove theorems in symbolic logic.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive framework for generating and evaluating LLM-powered agents in customer support scenarios, integrating with W&B Weave and Vertex AI. The addition of synthetic data generation and evaluation capabilities significantly enhances the project. The code is well-structured and includes detailed documentation. However, there are areas where improvements can be made to enhance maintainability, readability, and error handling.

Summary of Findings

  • Error Handling in Data Loading: The load_realistic_datasets function in customer_support_agent.py includes a try-except block for downloading and loading the Amazon Reviews Multi dataset. While this is good, the except block currently only prints an error message and raises the exception. It would be beneficial to implement a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available.
  • Metadata Generation: The _generate_metadata function in dataset_generator.py generates metadata about the agent's trajectory, such as the number of steps and tools used. This metadata is valuable for analysis and experiment tracking. However, the function could be extended to include additional information, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.
  • Weave Logging: The framework leverages Weave for experiment tracking and visualization. However, the Weave logging functionality could be enhanced to provide more detailed information about the agent's internal state. For example, the agent's memory, the current step number, and the available tools could be logged at each step. This would provide a more granular view of the agent's decision-making process and allow for more detailed analysis.

Merge Readiness

The pull request is not quite ready for merging. There are some high and medium severity issues that should be addressed before merging. I am unable to approve this pull request, and users should have others review and approve this code before merging.

Comment on lines +563 to +566
except Exception as e:
print(f"Error loading dataset from Kaggle: {str(e)}")
raise # Raise the exception instead of falling back to a simpler dataset
else:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Consider adding a fallback mechanism to load a simpler, pre-existing dataset in case the Kaggle download fails. This would ensure that the agent can still function even if the dataset is not available.

Suggested change
except Exception as e:
print(f"Error loading dataset from Kaggle: {str(e)}")
raise # Raise the exception instead of falling back to a simpler dataset
else:
except Exception as e:
print(f"Error loading dataset from Kaggle: {str(e)}")
# Fallback to a simpler dataset
products = pd.DataFrame([{'product_id': 'P1', 'name': 'Simple Product', 'category': 'Simple', 'price': 10.0, 'stock': 100, 'description': 'A basic product', 'warranty': 'None', 'return_period': 7}])
orders = pd.DataFrame([{'order_id': 'O1', 'customer_id': 'C1', 'product_id': 'P1', 'status': 'delivered', 'order_date': '2024-01-01', 'delivery_date': '2024-01-05', 'estimated_delivery_date': None, 'tracking_number': None}])
products.to_csv(products_path, index=False)
orders.to_csv(orders_path, index=False)
print("Falling back to a simpler dataset")

Comment on lines 308 to 312
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding more metadata, such as the latency of each tool call, the memory usage of the agent, and the success rate of each step. This would provide a more comprehensive view of the agent's performance and resource consumption.

Suggested change
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0
}
return {
"num_steps": len(trajectory.steps),
"tools_used": list(set(tools_used)), # Remove duplicates
"has_planning": self.agent.planning_interval > 0,
"average_tool_latency": np.mean(tool_latencies) if tool_latencies else 0,
"memory_usage": memory_usage
}

@ash0ts
Copy link
Author

ash0ts commented Mar 7, 2025

@inardini Here is the PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants