Add documentation for User Simulation (#863)

KPJoshi · joefernandez · web-flow · commit 35f00f9d4081 · 2025-11-05T13:08:43.000-08:00
* Add documentation for User Simulation

* Update user-sim.md with version number

---------

Co-authored-by: Joe Fernandez &lt;joefernandez@users.noreply.github.com&gt;
diff --git a/docs/evaluate/criteria.md b/docs/evaluate/criteria.md
@@ -7,15 +7,15 @@
 This page outlines the evaluation criteria provided by ADK to assess agent
 performance, including tool use trajectory, response quality, and safety.
 
-Criterion                                | Description                                               | Reference-Based | Requires Rubrics | LLM-as-a-Judge
-:--------------------------------------- | :-------------------------------------------------------- | :-------------- | :--------------- | :-------------
-`tool_trajectory_avg_score`              | Exact match of tool call trajectory                       | Yes             | No               | No
-`response_match_score`                   | ROUGE-1 similarity to reference response                  | Yes             | No               | No
-`final_response_match_v2`                | LLM-judged semantic match to reference response           | Yes             | No               | Yes
-`rubric_based_final_response_quality_v1` | LLM-judged final response quality based on custom rubrics | No              | Yes              | Yes
-`rubric_based_tool_use_quality_v1`       | LLM-judged tool usage quality based on custom rubrics     | No              | Yes              | Yes
-`hallucinations_v1`                      | LLM-judged groundedness of agent response against context | No              | No               | Yes
-`safety_v1`                              | Safety/harmlessness of agent response                     | No              | No               | Yes
+Criterion                                | Description                                               | Reference-Based | Requires Rubrics | LLM-as-a-Judge | Supports [User Simulation](./user-sim.md)
+:--------------------------------------- | :-------------------------------------------------------- | :-------------- | :--------------- | :------------- | :----------------------------------------
+`tool_trajectory_avg_score`              | Exact match of tool call trajectory                       | Yes             | No               | No             | No
+`response_match_score`                   | ROUGE-1 similarity to reference response                  | Yes             | No               | No             | No
+`final_response_match_v2`                | LLM-judged semantic match to reference response           | Yes             | No               | Yes            | No
+`rubric_based_final_response_quality_v1` | LLM-judged final response quality based on custom rubrics | No              | Yes              | Yes            | No
+`rubric_based_tool_use_quality_v1`       | LLM-judged tool usage quality based on custom rubrics     | No              | Yes              | Yes            | No
+`hallucinations_v1`                      | LLM-judged groundedness of agent response against context | No              | No               | Yes            | Yes
+`safety_v1`                              | Safety/harmlessness of agent response                     | No              | No               | Yes            | Yes
 
 ## tool_trajectory_avg_score
 
@@ -420,4 +420,4 @@ Example `EvalConfig` entry:
 
 The criterion returns a score between 0.0 and 1.0. Scores closer to 1.0 indicate
 that the response is safe, while scores closer to 0.0 indicate potential safety
-issues.
+issues.
diff --git a/docs/evaluate/index.md b/docs/evaluate/index.md
@@ -152,7 +152,7 @@ and initial session file can be ignored (or removed.)
 
 The evalset approach utilizes a dedicated dataset called an "evalset" for evaluating agent-model interactions. Similar to a test file, the evalset contains example interactions. However, an evalset can contain multiple, potentially lengthy sessions, making it ideal for simulating complex, multi-turn conversations. Due to its ability to represent complex sessions, the evalset is well-suited for integration tests. These tests are typically run less frequently than unit tests due to their more extensive nature.
 
-An evalset file contains multiple "evals," each representing a distinct session. Each eval consists of one or more "turns," which include the user query, expected tool use, expected intermediate agent responses, and a reference response. These fields have the same meaning as they do in the test file approach. Each eval is identified by a unique name. Furthermore, each eval includes an associated initial session state.
+An evalset file contains multiple "evals," each representing a distinct session. Each eval consists of one or more "turns," which include the user query, expected tool use, expected intermediate agent responses, and a reference response. These fields have the same meaning as they do in the test file approach. Alternatively, an eval can define a *conversation scenario* which is used to [dynamically simulate](./user-sim.md) a user interaction with the agent. Each eval is identified by a unique name. Furthermore, each eval includes an associated initial session state.
 
 Creating evalsets manually can be complex, therefore UI tools are provided to help capture relevant sessions and easily convert them into evals within your evalset. Learn more about using the web UI for evaluation below. Here is an example evalset containing two sessions. The eval set files are  backed by a formal Pydantic data model. The two key schema files are
 [Eval Set](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) and
@@ -313,7 +313,7 @@ Based on who is maintaining the eval set data, there are two routes:
 
 ### Evaluation Criteria
 
-ADK provides several built-in criterion for evaluating agent performance, ranging
+ADK provides several built-in criteria for evaluating agent performance, ranging
 from tool trajectory matching to LLM-based response quality assessment. For a
 detailed list of available criteria and guidance on when to use them, please see
 [Evaluation Criteria](./criteria.md).
@@ -374,6 +374,22 @@ Choose criteria based on your evaluation goals:
 *   **Check for harmful content:** Use `safety_v1` to ensure that agent
     responses are safe and do not violate safety policies.
 
+In addition, criteria which require information on expected agent tool use
+and/or responses are not supported in combination with
+[User Simulation](./user-sim.md).
+Currently, only the `hallucinations_v1` and `safety_v1` criteria support such evals.
+
+### User Simulation
+
+When evaluating conversational agents, it is not always practical to use a fixed
+set of user prompts, as the conversation can proceed in unexpected ways.
+For example, if the agent needs the user to supply two values to perform a task,
+it may ask for those values one at a time or both at once.
+To resolve this issue, ADK allows you test the behavior of the agent in a
+specific *conversation scenario* with user prompts that are dynamically
+generated by an AI model.
+For details on how to set up an eval with user simulation, see
+[User Simulation](./user-sim.md).
 
 ## How to run Evaluation with the ADK
 
diff --git a/docs/evaluate/user-sim.md b/docs/evaluate/user-sim.md
@@ -0,0 +1,153 @@
+# User Simulation
+
+<div class="language-support-tag">
+    <span class="lst-supported">Supported in ADK</span><span class="lst-python">Python v1.18.0</span>
+</div>
+
+When evaluating conversational agents, it is not always practical to use a fixed
+set of user prompts, as the conversation can proceed in unexpected ways.
+For example, if the agent needs the user to supply two values to perform a task,
+it may ask for those values one at a time or both at once.
+To resolve this issue, ADK can dynamically generate user prompts using a
+generative AI model.
+
+To use this feature, you must specify a
+[`ConversationScenario`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/conversation_scenarios.py)
+which dictates the user's goals in their conversation with the agent.
+A sample conversation scenario for the
+[`hello_world`](https://github.com/google/adk-python/tree/main/contributing/samples/hello_world)
+agent is shown below:
+
+```json
+{
+  "starting_prompt": "What can you do for me?",
+  "conversation_plan": "Ask the agent to roll a 20-sided die. After you get the result, ask the agent to check if it is prime."
+}
+```
+
+The `starting_prompt` in a conversation scenario specifies a fixed initial
+prompt that the user should use to start the conversation with the agent.
+Specifying such fixed prompts for subsequent interactions with the agent is not
+practical as the agent may respond in different ways.
+Instead, the `conversation_plan` provides a guideline for how the rest of the
+conversation with the agent should proceed.
+An LLM uses this conversation plan, along with the conversation history, to
+dynamically generate user prompts until it judges that the conversation is
+complete.
+
+## Example: Evaluating the [`hello_world`](https://github.com/google/adk-python/tree/main/contributing/samples/hello_world) agent with conversation scenarios
+
+To add evaluation cases containing conversation scenarios to a new or existing
+[`EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py),
+you need to first create a list of conversation scenarios to test the agent in.
+
+Try saving the following to
+`contributing/samples/hello_world/conversation_scenarios.json`:
+
+```json
+{
+  "scenarios": [
+    {
+      "starting_prompt": "What can you do for me?",
+      "conversation_plan": "Ask the agent to roll a 20-sided die. After you get the result, ask the agent to check if it is prime."
+    },
+    {
+      "starting_prompt": "Hi, I'm running a tabletop RPG in which prime numbers are bad!",
+      "conversation_plan": "Say that you don't care about the value; you just want the agent to tell you if a roll is good or bad. Once the agent agrees, ask it to roll a 6-sided die. Finally, ask the agent to do the same with 2 20-sided dice."
+    }
+  ]
+}
+```
+
+You will also need a session input file containing information used during
+evaluation.
+Try saving the following to
+`contributing/samples/hello_world/session_input.json`:
+
+```json
+{
+  "app_name": "hello_world",
+  "user_id": "user"
+}
+```
+
+Then, you can add the conversation scenarios to an `EvalSet`:
+
+```bash
+# (optional) create a new EvalSet
+adk eval_set create \
+  contributing/samples/hello_world \
+  eval_set_with_scenarios
+
+# add conversation scenarios to the EvalSet as new eval cases
+adk eval_set add_eval_case \
+  contributing/samples/hello_world \
+  eval_set_with_scenarios \
+  --scenarios_file contributing/samples/hello_world/conversation_scenarios.json \
+  --session_input_file contributing/samples/hello_world/session_input.json
+```
+
+By default, ADK runs evaluations with metrics that require the agent's expected
+response to be specified.
+Since that is not the case for a dynamic conversation scenario, we will use an
+[`EvalConfig`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_config.py)
+with some alternate supported metrics.
+
+Try saving the following to
+`contributing/samples/hello_world/eval_config.json`:
+
+```json
+{
+  "criteria": {
+    "hallucinations_v1": {
+      "threshold": 0.5,
+      "evaluate_intermediate_nl_responses": true
+    },
+    "safety_v1": {
+      "threshold": 0.8
+    }
+  }
+}
+```
+
+Finally, you can use the `adk eval` command to run the evaluation:
+
+```bash
+adk eval \
+    contributing/samples/hello_world \
+    --config_file_path contributing/samples/hello_world/eval_config.json \
+    eval_set_with_scenarios \
+    --print_detailed_results
+```
+
+## User simulator configuration
+
+You can override the default user simulator configuration to change the model,
+internal model behavior, and the maximum number of user-agent interactions.
+The below `EvalConfig` shows the default user simulator configuration:
+
+```json
+{
+  "criteria": {
+    # same as before
+  },
+  "user_simulator_config": {
+    "model": "gemini-2.5-flash",
+    "model_configuration": {
+      "thinking_config": {
+        "include_thoughts": true,
+        "thinking_budget": 10240
+      }
+    },
+    "max_allowed_invocations": 20
+  }
+}
+```
+
+* `model`: The model backing the user simulator.
+* `model_configuration`: A
+[`GenerateContentConfig`](https://github.com/googleapis/python-genai/blob/6196b1b4251007e33661bb5d7dc27bafee3feefe/google/genai/types.py#L4295)
+which controls the model behavior.
+* `max_allowed_invocations`: The maximum user-agent interactions allowed before
+the conversation is forcefully terminated. This should be set to be greater than
+the longest reasonable user-agent interaction in your `EvalSet`.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -206,6 +206,7 @@ nav:
     - Evaluation:
       - evaluate/index.md
       - Criteria: evaluate/criteria.md
+      - User Simulation: evaluate/user-sim.md
     - Safety and Security:
       - safety/index.md
   - Components: