diff --git a/docs/evaluate/criteria.md b/docs/evaluate/criteria.md new file mode 100644 index 000000000..77755e892 --- /dev/null +++ b/docs/evaluate/criteria.md @@ -0,0 +1,419 @@ +# Evaluation Criteria + +This page outlines the evaluation criteria provided by ADK to assess agent +performance, including tool use trajectory, response quality, and safety. + +Criterion | Description | Reference-Based | Requires Rubrics | LLM-as-a-Judge +:--------------------------------------- | :-------------------------------------------------------- | :-------------- | :--------------- | :------------- +`tool_trajectory_avg_score` | Exact match of tool call trajectory | Yes | No | No +`response_match_score` | ROUGE-1 similarity to reference response | Yes | No | No +`final_response_match_v2` | LLM-judged semantic match to reference response | Yes | No | Yes +`rubric_based_final_response_quality_v1` | LLM-judged final response quality based on custom rubrics | No | Yes | Yes +`rubric_based_tool_use_quality_v1` | LLM-judged tool usage quality based on custom rubrics | No | Yes | Yes +`hallucinations_v1` | LLM-judged groundedness of agent response against context | No | No | Yes +`safety_v1` | Safety/harmlessness of agent response | No | No | Yes + +## tool_trajectory_avg_score + +This criterion compares the sequence of tools called by the agent against a list +of expected calls and computes an average score based on exact match. + +### When To Use This Criterion? + +This criterion is ideal for scenarios where the correctness of an agent's +behavior is strictly dependent on following a precise sequence of tool calls +with exact arguments. Use it when you need to enforce a specific tool execution +path and consider any deviation—whether in tool name, arguments, or order—as a +failure. It is particularly valuable for: + +* **Regression testing:** Ensuring that agent updates do not unintentionally + alter tool call behavior for established test cases. +* **Workflow validation:** Verifying that agents correctly follow predefined + workflows that require specific API calls in a specific order. +* **High-precision tasks:** Evaluating tasks where slight deviations in tool + parameters or call order can lead to significantly different or incorrect + outcomes. + +### Details + +For each invocation that is being evaluated, this criterion compares the list +and order of tool calls produced by the agent against the list of expected tool +calls. The comparison is done by performing an exact match on the tool name and +tool arguments for each tool call in the list. If all tool calls in an +invocation match exactly in content and order, a score of 1.0 is awarded for +that invocation, otherwise the score is 0.0. The final value is the average of +these scores across all invocations in the eval case. + +### How To Use This Criterion? + +You can specify a threshold for this criterion in `EvalConfig` under the +`criteria` dictionary. The value should be a float between 0.0 and 1.0, which +represents the minimum acceptable score for the eval case to pass. If you expect +tool trajectories to match exactly in all invocations, you should set the +threshold to 1.0. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "tool_trajectory_avg_score": 1.0 + } +} +``` + +### Output And How To Interpret + +The output is a score between 0.0 and 1.0, where 1.0 indicates a perfect match +between actual and expected tool trajectories for all invocations, and 0.0 +indicates a complete mismatch for all invocations. Higher scores are better. A +score below 1.0 means that for at least one invocation, the agent's tool call +trajectory deviated from the expected one. + +## response_match_score + +This criterion evaluates if agent's final response matches a golden/expected +final response using Rouge-1. + +### When To Use This Criterion? + +Use this criterion when you need a quantitative measure of how closely the +agent's output matches the expected output in terms of content overlap. + +### Details + +ROUGE-1 specifically measures the overlap of unigrams (single words) between the +system-generated text (candidate summary) and the a reference text. It +essentially checks how many individual words from the reference text are present +in the candidate text. To learn more, see details on +[ROUGE-1](https://github.com/google-research/google-research/tree/master/rouge). + +### How To Use This Criterion? + +You can specify a threshold for this criterion in `EvalConfig` under the +`criteria` dictionary. The value should be a float between 0.0 and 1.0, which +represents the minimum acceptable score for the eval case to pass. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "response_match_score": 0.8 + } +} +``` + +### Output And How To Interpret + +Value range for this criterion is [0,1], with values closer to 1 more desirable. + +## final_response_match_v2 + +This criterion evaluates if the agent's final response matches a golden/expected +final response using LLM as a judge. + +### When To Use This Criterion? + +Use this criterion when you need to evaluate the correctness of an agent's final +response against a reference, but require flexibility in how the answer is +presented. It is suitable for cases where different phrasings or formats are +acceptable, as long as the core meaning and information match the reference. +This criterion is a good choice for evaluating question-answering, +summarization, or other generative tasks where semantic equivalence is more +important than exact lexical overlap, making it a more sophisticated alternative +to `response_match_score`. + +### Details + +This criterion uses a Large Language Model (LLM) as a judge to determine if the +agent's final response is semantically equivalent to the provided reference +response. It is designed to be more flexible than lexical matching metrics (like +`response_match_score`), as it focuses on whether the agent's response contains +the correct information, while tolerating differences in formatting, phrasing, +or the inclusion of additional correct details. + +For each invocation, the criterion prompts a judge LLM to rate the agent's +response as "valid" or "invalid" compared to the reference. This is repeated +multiple times for robustness (configurable via `num_samples`), and a majority +vote determines if the invocation receives a score of 1.0 (valid) or 0.0 +(invalid). The final criterion score is the fraction of invocations deemed valid +across the entire eval case. + +### How To Use This Criterion? + +This criterion uses `LlmAsAJudgeCriterion`, allowing you to configure the +evaluation threshold, the judge model, and the number of samples per invocation. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "final_response_match_v2": { + "threshold": 0.8, + "judge_model_options": { + "judge_model": "gemini-2.5-flash", + "num_samples": 5 + } + } + } + } +} +``` + +### Output And How To Interpret + +The criterion returns a score between 0.0 and 1.0. A score of 1.0 means the LLM +judge considered the agent's final response to be valid for all invocations, +while a score closer to 0.0 indicates that many responses were judged as invalid +when compared to the reference responses. Higher values are better. + +## rubric_based_final_response_quality_v1 + +This criterion assesses the quality of an agent's final response against a +user-defined set of rubrics using LLM as a judge. + +### When To Use This Criterion? + +Use this criterion when you need to evaluate aspects of response quality that go +beyond simple correctness or semantic equivalence with a reference. It is ideal +for assessing nuanced attributes like tone, style, helpfulness, or adherence to +specific conversational guidelines defined in your rubrics. This criterion is +particularly useful when no single reference response exists, or when quality +depends on multiple subjective factors. + +### Details + +This criterion provides a flexible way to evaluate response quality based on +specific criteria that you define as rubrics. For example, you could define +rubrics to check if a response is concise, if it correctly infers user intent, +or if it avoids jargon. + +The criterion uses an LLM-as-a-judge to evaluate the agent's final response +against each rubric, producing a `yes` (1.0) or `no` (0.0) verdict for each. +Like other LLM-based metrics, it samples the judge model multiple times per +invocation and uses a majority vote to determine the score for each rubric in +that invocation. The overall score for an invocation is the average of its +rubric scores. The final criterion score for the eval case is the average of +these overall scores across all invocations. + +### How To Use This Criterion? + +This criterion uses `RubricsBasedCriterion`, which requires a list of rubrics to +be provided in the `EvalConfig`. Each rubric should be defined with a unique ID +and its content. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "rubric_based_final_response_quality_v1": { + "threshold": 0.8, + "judge_model_options": { + "judge_model": "gemini-2.5-flash", + "num_samples": 5 + }, + "rubrics": [ + { + "rubric_id": "conciseness", + "rubric_content": { + "text_property": "The agent's response is direct and to the point." + } + }, + { + "rubric_id": "intent_inference", + "rubric_content": { + "text_property": "The agent's response accurately infers the user's underlying goal from ambiguous queries." + } + } + ] + } + } +} +``` + +### Output And How To Interpret + +The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates +that the agent's responses satisfied all rubrics across all invocations, and 0.0 +indicates that no rubrics were satisfied. The results also include detailed +per-rubric scores for each invocation. Higher values are better. + +## rubric_based_tool_use_quality_v1 + +This criterion assesses the quality of an agent's tool usage against a +user-defined set of rubrics using LLM as a judge. + +### When To Use This Criterion? + +Use this criterion when you need to evaluate *how* an agent uses tools, rather +than just *if* the final response is correct. It is ideal for assessing whether +the agent selected the right tool, used the correct parameters, or followed a +specific sequence of tool calls. This is useful for validating agent reasoning +processes, debugging tool-use errors, and ensuring adherence to prescribed +workflows, especially in cases where multiple tool-use paths could lead to a +similar final answer but only one path is considered correct. + +### Details + +This criterion provides a flexible way to evaluate tool usage based on specific +rules that you define as rubrics. For example, you could define rubrics to check +if a specific tool was called, if its parameters were correct, or if tools were +called in a particular order. + +The criterion uses an LLM-as-a-judge to evaluate the agent's tool calls and +responses against each rubric, producing a `yes` (1.0) or `no` (0.0) verdict for +each. Like other LLM-based metrics, it samples the judge model multiple times +per invocation and uses a majority vote to determine the score for each rubric +in that invocation. The overall score for an invocation is the average of its +rubric scores. The final criterion score for the eval case is the average of +these overall scores across all invocations. + +### How To Use This Criterion? + +This criterion uses `RubricsBasedCriterion`, which requires a list of rubrics to +be provided in the `EvalConfig`. Each rubric should be defined with a unique ID +and its content, describing a specific aspect of tool use to evaluate. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "rubric_based_tool_use_quality_v1": { + "threshold": 1.0, + "judge_model_options": { + "judge_model": "gemini-2.5-flash", + "num_samples": 5 + }, + "rubrics": [ + { + "rubric_id": "geocoding_called", + "rubric_content": { + "text_property": "The agent calls the GeoCoding tool before calling the GetWeather tool." + } + }, + { + "rubric_id": "getweather_called", + "rubric_content": { + "text_property": "The agent calls the GetWeather tool with coordinates derived from the user's location." + } + } + ] + } + } +} +``` + +### Output And How To Interpret + +The criterion outputs an overall score between 0.0 and 1.0, where 1.0 indicates +that the agent's tool usage satisfied all rubrics across all invocations, and +0.0 indicates that no rubrics were satisfied. The results also include detailed +per-rubric scores for each invocation. Higher values are better. + +## hallucinations_v1 + +This criterion assesses whether a model response contains any false, +contradictory, or unsupported claims. + +### When To Use This Criterion? + +Use this criterion to ensure the agent's response is grounded in the provided +context (e.g., tool outputs, user query, instructions) and does not contain +hallucinations. + +### Details + +This criterion assesses whether a model response contains any false, +contradictory, or unsupported claims based on context that includes developer +instructions, user prompt, tool definitions, and tool invocations and their +results. It uses LLM-as-a-judge and follows a two-step process: + +1. **Segmenter**: Segments the agent response into individual sentences. +2. **Sentence Validator**: Evaluates each segmented sentence against the + provided context for grounding. Each sentence is labeled as `supported`, + `unsupported`, `contradictory`, `disputed` or `not_applicable`. + +The metric computes an Accuracy Score: the percentage of sentences that are +`supported` or `not_applicable`. By default, only the final response is +evaluated. If `evaluate_intermediate_nl_responses` is set to true in the +criterion, intermediate natural language responses from agents are also +evaluated. + +### How To Use This Criterion? + +This criterion uses `HallucinationsCriterion`, allowing you to configure the +evaluation threshold, the judge model, the number of samples per invocation and +whether to evaluate intermediate natural language responses. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "hallucinations_v1": { + "threshold": 0.8, + "judge_model_options": { + "judge_model": "gemini-2.5-flash", + }, + "evaluate_intermediate_nl_responses": true + } + } +} + +``` + +### Output And How To Interpret + +The criterion returns a score between 0.0 and 1.0. A score of 1.0 means all +sentences in agent's response are grounded in the context, while a score closer +to 0.0 indicates that many sentences are false, contradictory, or unsupported. +Higher values are better. + +## safety_v1 + +This criterion evaluates the safety (harmlessness) of an Agent's Response. + +### When To Use This Criterion? + +This criterion should be used when you need to ensure that agent responses +comply with safety guidelines and do not produce harmful or inappropriate +content. It is essential for user-facing applications or any system where +response safety is a priority. + +### Details + +This criterion assesses whether the agent's response contains any harmful +content, such as hate speech, harassment, or dangerous information. Unlike other +metrics implemented natively within ADK, `safety_v1` delegates the evaluation to +the Vertex AI General AI Eval SDK. + +### How To Use This Criterion? + +Using this criterion requires a Google Cloud Project. You must have +`GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` environment variables set, +typically in an `.env` file in your agent's directory, for the Vertex AI SDK to +function correctly. + +You can specify a threshold for this criterion in `EvalConfig` under the +`criteria` dictionary. The value should be a float between 0.0 and 1.0, +representing the minimum safety score for a response to be considered passing. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "safety_v1": 0.8 + } +} +``` + +### Output And How To Interpret + +The criterion returns a score between 0.0 and 1.0. Scores closer to 1.0 indicate +that the response is safe, while scores closer to 0.0 indicate potential safety +issues. \ No newline at end of file diff --git a/docs/evaluate/index.md b/docs/evaluate/index.md index 5fa6f1424..51eed5ef0 100644 --- a/docs/evaluate/index.md +++ b/docs/evaluate/index.md @@ -311,10 +311,24 @@ Based on who is maintaining the eval set data, there are two routes: ### Evaluation Criteria -The evaluation criteria define how the agent's performance is measured against the evalset. The following metrics are supported: - -* `tool_trajectory_avg_score`: This metric compares the agent's actual tool usage during the evaluation against the expected tool usage defined in the `expected_tool_use` field. Each matching tool usage step receives a score of 1, while a mismatch receives a score of 0\. The final score is the average of these matches, representing the accuracy of the tool usage trajectory. -* `response_match_score`: This metric compares the agent's final natural language response to the expected final response, stored in the `reference` field. We use the [ROUGE](https://en.wikipedia.org/wiki/ROUGE_\(metric\)) metric to calculate the similarity between the two responses. +ADK provides several built-in criterion for evaluating agent performance, ranging +from tool trajectory matching to LLM-based response quality assessment. For a +detailed list of available criteria and guidance on when to use them, please see +[Evaluation Criteria](./criteria.md). + +Here is a summary of all the available criteria: + +* **tool_trajectory_avg_score**: Exact match of tool call trajectory. +* **response_match_score**: ROUGE-1 similarity to reference response. +* **final_response_match_v2**: LLM-judged semantic match to a reference + response. +* **rubric_based_final_response_quality_v1**: LLM-judged final response + quality based on custom rubrics. +* **rubric_based_tool_use_quality_v1**: LLM-judged tool usage quality based on + custom rubrics. +* **hallucinations_v1**: LLM-judged groundedness of agent response against + context. +* **safety_v1**: Safety/harmlessness of agent response. If no evaluation criteria are provided, the following default configuration is used: @@ -332,6 +346,33 @@ Here is an example of a `test_config.json` file specifying custom evaluation cri } ``` +#### Recommendations on Criteria + +Choose criteria based on your evaluation goals: + +* **Enable tests in CI/CD pipelines or regression testing:** Use + `tool_trajectory_avg_score` and `response_match_score`. These criteria are + fast, predictable, and suitable for frequent automated checks. +* **Evaluate trusted reference responses:** Use `final_response_match_v2` to + evaluate semantic equivalence. This LLM-based check is more flexible than + exact matching and better captures whether the agent's response means the + same thing as the reference response. +* **Evaluate response quality without a reference response:** Use + `rubric_based_final_response_quality_v1`. This is useful when you don't have + a trusted reference, but you can define attributes of a good response (e.g., + "The response is concise," "The response has a helpful tone"). +* **Evaluate the correctness of tool usage:** Use + `rubric_based_tool_use_quality_v1`. This allows you to validate the agent's + reasoning process by checking, for example, that a specific tool was called + or that tools were called in the correct order (e.g., "Tool A must be called + before Tool B"). +* **Check if responses are grounded in context:** Use `hallucinations_v1` to + detect if the agent makes claims that are unsupported by or contradictory to + the information available to it (e.g., tool outputs). +* **Check for harmful content:** Use `safety_v1` to ensure that agent + responses are safe and do not violate safety policies. + + ## How to run Evaluation with the ADK As a developer, you can evaluate your agents using the ADK in the following ways: diff --git a/mkdocs.yml b/mkdocs.yml index d0a0ac977..c828338ac 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -192,6 +192,7 @@ nav: - W&B Weave: observability/weave.md - Evaluate: - evaluate/index.md + - Criteria: evaluate/criteria.md - MCP: - mcp/index.md - Plugins: