Skip to content

Commit 35f00f9

Browse files
Add documentation for User Simulation (#863)
* Add documentation for User Simulation * Update user-sim.md with version number --------- Co-authored-by: Joe Fernandez <[email protected]>
1 parent 81d68a5 commit 35f00f9

File tree

4 files changed

+182
-12
lines changed

4 files changed

+182
-12
lines changed

docs/evaluate/criteria.md

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,15 @@
77
This page outlines the evaluation criteria provided by ADK to assess agent
88
performance, including tool use trajectory, response quality, and safety.
99

10-
Criterion | Description | Reference-Based | Requires Rubrics | LLM-as-a-Judge
11-
:--------------------------------------- | :-------------------------------------------------------- | :-------------- | :--------------- | :-------------
12-
`tool_trajectory_avg_score` | Exact match of tool call trajectory | Yes | No | No
13-
`response_match_score` | ROUGE-1 similarity to reference response | Yes | No | No
14-
`final_response_match_v2` | LLM-judged semantic match to reference response | Yes | No | Yes
15-
`rubric_based_final_response_quality_v1` | LLM-judged final response quality based on custom rubrics | No | Yes | Yes
16-
`rubric_based_tool_use_quality_v1` | LLM-judged tool usage quality based on custom rubrics | No | Yes | Yes
17-
`hallucinations_v1` | LLM-judged groundedness of agent response against context | No | No | Yes
18-
`safety_v1` | Safety/harmlessness of agent response | No | No | Yes
10+
Criterion | Description | Reference-Based | Requires Rubrics | LLM-as-a-Judge | Supports [User Simulation](./user-sim.md)
11+
:--------------------------------------- | :-------------------------------------------------------- | :-------------- | :--------------- | :------------- | :----------------------------------------
12+
`tool_trajectory_avg_score` | Exact match of tool call trajectory | Yes | No | No | No
13+
`response_match_score` | ROUGE-1 similarity to reference response | Yes | No | No | No
14+
`final_response_match_v2` | LLM-judged semantic match to reference response | Yes | No | Yes | No
15+
`rubric_based_final_response_quality_v1` | LLM-judged final response quality based on custom rubrics | No | Yes | Yes | No
16+
`rubric_based_tool_use_quality_v1` | LLM-judged tool usage quality based on custom rubrics | No | Yes | Yes | No
17+
`hallucinations_v1` | LLM-judged groundedness of agent response against context | No | No | Yes | Yes
18+
`safety_v1` | Safety/harmlessness of agent response | No | No | Yes | Yes
1919

2020
## tool_trajectory_avg_score
2121

@@ -420,4 +420,4 @@ Example `EvalConfig` entry:
420420

421421
The criterion returns a score between 0.0 and 1.0. Scores closer to 1.0 indicate
422422
that the response is safe, while scores closer to 0.0 indicate potential safety
423-
issues.
423+
issues.

docs/evaluate/index.md

Lines changed: 18 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -152,7 +152,7 @@ and initial session file can be ignored (or removed.)
152152

153153
The evalset approach utilizes a dedicated dataset called an "evalset" for evaluating agent-model interactions. Similar to a test file, the evalset contains example interactions. However, an evalset can contain multiple, potentially lengthy sessions, making it ideal for simulating complex, multi-turn conversations. Due to its ability to represent complex sessions, the evalset is well-suited for integration tests. These tests are typically run less frequently than unit tests due to their more extensive nature.
154154

155-
An evalset file contains multiple "evals," each representing a distinct session. Each eval consists of one or more "turns," which include the user query, expected tool use, expected intermediate agent responses, and a reference response. These fields have the same meaning as they do in the test file approach. Each eval is identified by a unique name. Furthermore, each eval includes an associated initial session state.
155+
An evalset file contains multiple "evals," each representing a distinct session. Each eval consists of one or more "turns," which include the user query, expected tool use, expected intermediate agent responses, and a reference response. These fields have the same meaning as they do in the test file approach. Alternatively, an eval can define a *conversation scenario* which is used to [dynamically simulate](./user-sim.md) a user interaction with the agent. Each eval is identified by a unique name. Furthermore, each eval includes an associated initial session state.
156156

157157
Creating evalsets manually can be complex, therefore UI tools are provided to help capture relevant sessions and easily convert them into evals within your evalset. Learn more about using the web UI for evaluation below. Here is an example evalset containing two sessions. The eval set files are backed by a formal Pydantic data model. The two key schema files are
158158
[Eval Set](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py) and
@@ -313,7 +313,7 @@ Based on who is maintaining the eval set data, there are two routes:
313313

314314
### Evaluation Criteria
315315

316-
ADK provides several built-in criterion for evaluating agent performance, ranging
316+
ADK provides several built-in criteria for evaluating agent performance, ranging
317317
from tool trajectory matching to LLM-based response quality assessment. For a
318318
detailed list of available criteria and guidance on when to use them, please see
319319
[Evaluation Criteria](./criteria.md).
@@ -374,6 +374,22 @@ Choose criteria based on your evaluation goals:
374374
* **Check for harmful content:** Use `safety_v1` to ensure that agent
375375
responses are safe and do not violate safety policies.
376376

377+
In addition, criteria which require information on expected agent tool use
378+
and/or responses are not supported in combination with
379+
[User Simulation](./user-sim.md).
380+
Currently, only the `hallucinations_v1` and `safety_v1` criteria support such evals.
381+
382+
### User Simulation
383+
384+
When evaluating conversational agents, it is not always practical to use a fixed
385+
set of user prompts, as the conversation can proceed in unexpected ways.
386+
For example, if the agent needs the user to supply two values to perform a task,
387+
it may ask for those values one at a time or both at once.
388+
To resolve this issue, ADK allows you test the behavior of the agent in a
389+
specific *conversation scenario* with user prompts that are dynamically
390+
generated by an AI model.
391+
For details on how to set up an eval with user simulation, see
392+
[User Simulation](./user-sim.md).
377393

378394
## How to run Evaluation with the ADK
379395

docs/evaluate/user-sim.md

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
# User Simulation
2+
3+
<div class="language-support-tag">
4+
<span class="lst-supported">Supported in ADK</span><span class="lst-python">Python v1.18.0</span>
5+
</div>
6+
7+
When evaluating conversational agents, it is not always practical to use a fixed
8+
set of user prompts, as the conversation can proceed in unexpected ways.
9+
For example, if the agent needs the user to supply two values to perform a task,
10+
it may ask for those values one at a time or both at once.
11+
To resolve this issue, ADK can dynamically generate user prompts using a
12+
generative AI model.
13+
14+
To use this feature, you must specify a
15+
[`ConversationScenario`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/conversation_scenarios.py)
16+
which dictates the user's goals in their conversation with the agent.
17+
A sample conversation scenario for the
18+
[`hello_world`](https://github.com/google/adk-python/tree/main/contributing/samples/hello_world)
19+
agent is shown below:
20+
21+
```json
22+
{
23+
"starting_prompt": "What can you do for me?",
24+
"conversation_plan": "Ask the agent to roll a 20-sided die. After you get the result, ask the agent to check if it is prime."
25+
}
26+
```
27+
28+
The `starting_prompt` in a conversation scenario specifies a fixed initial
29+
prompt that the user should use to start the conversation with the agent.
30+
Specifying such fixed prompts for subsequent interactions with the agent is not
31+
practical as the agent may respond in different ways.
32+
Instead, the `conversation_plan` provides a guideline for how the rest of the
33+
conversation with the agent should proceed.
34+
An LLM uses this conversation plan, along with the conversation history, to
35+
dynamically generate user prompts until it judges that the conversation is
36+
complete.
37+
38+
## Example: Evaluating the [`hello_world`](https://github.com/google/adk-python/tree/main/contributing/samples/hello_world) agent with conversation scenarios
39+
40+
To add evaluation cases containing conversation scenarios to a new or existing
41+
[`EvalSet`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_set.py),
42+
you need to first create a list of conversation scenarios to test the agent in.
43+
44+
Try saving the following to
45+
`contributing/samples/hello_world/conversation_scenarios.json`:
46+
47+
```json
48+
{
49+
"scenarios": [
50+
{
51+
"starting_prompt": "What can you do for me?",
52+
"conversation_plan": "Ask the agent to roll a 20-sided die. After you get the result, ask the agent to check if it is prime."
53+
},
54+
{
55+
"starting_prompt": "Hi, I'm running a tabletop RPG in which prime numbers are bad!",
56+
"conversation_plan": "Say that you don't care about the value; you just want the agent to tell you if a roll is good or bad. Once the agent agrees, ask it to roll a 6-sided die. Finally, ask the agent to do the same with 2 20-sided dice."
57+
}
58+
]
59+
}
60+
```
61+
62+
You will also need a session input file containing information used during
63+
evaluation.
64+
Try saving the following to
65+
`contributing/samples/hello_world/session_input.json`:
66+
67+
```json
68+
{
69+
"app_name": "hello_world",
70+
"user_id": "user"
71+
}
72+
```
73+
74+
Then, you can add the conversation scenarios to an `EvalSet`:
75+
76+
```bash
77+
# (optional) create a new EvalSet
78+
adk eval_set create \
79+
contributing/samples/hello_world \
80+
eval_set_with_scenarios
81+
82+
# add conversation scenarios to the EvalSet as new eval cases
83+
adk eval_set add_eval_case \
84+
contributing/samples/hello_world \
85+
eval_set_with_scenarios \
86+
--scenarios_file contributing/samples/hello_world/conversation_scenarios.json \
87+
--session_input_file contributing/samples/hello_world/session_input.json
88+
```
89+
90+
By default, ADK runs evaluations with metrics that require the agent's expected
91+
response to be specified.
92+
Since that is not the case for a dynamic conversation scenario, we will use an
93+
[`EvalConfig`](https://github.com/google/adk-python/blob/main/src/google/adk/evaluation/eval_config.py)
94+
with some alternate supported metrics.
95+
96+
Try saving the following to
97+
`contributing/samples/hello_world/eval_config.json`:
98+
99+
```json
100+
{
101+
"criteria": {
102+
"hallucinations_v1": {
103+
"threshold": 0.5,
104+
"evaluate_intermediate_nl_responses": true
105+
},
106+
"safety_v1": {
107+
"threshold": 0.8
108+
}
109+
}
110+
}
111+
```
112+
113+
Finally, you can use the `adk eval` command to run the evaluation:
114+
115+
```bash
116+
adk eval \
117+
contributing/samples/hello_world \
118+
--config_file_path contributing/samples/hello_world/eval_config.json \
119+
eval_set_with_scenarios \
120+
--print_detailed_results
121+
```
122+
123+
## User simulator configuration
124+
125+
You can override the default user simulator configuration to change the model,
126+
internal model behavior, and the maximum number of user-agent interactions.
127+
The below `EvalConfig` shows the default user simulator configuration:
128+
129+
```json
130+
{
131+
"criteria": {
132+
# same as before
133+
},
134+
"user_simulator_config": {
135+
"model": "gemini-2.5-flash",
136+
"model_configuration": {
137+
"thinking_config": {
138+
"include_thoughts": true,
139+
"thinking_budget": 10240
140+
}
141+
},
142+
"max_allowed_invocations": 20
143+
}
144+
}
145+
```
146+
147+
* `model`: The model backing the user simulator.
148+
* `model_configuration`: A
149+
[`GenerateContentConfig`](https://github.com/googleapis/python-genai/blob/6196b1b4251007e33661bb5d7dc27bafee3feefe/google/genai/types.py#L4295)
150+
which controls the model behavior.
151+
* `max_allowed_invocations`: The maximum user-agent interactions allowed before
152+
the conversation is forcefully terminated. This should be set to be greater than
153+
the longest reasonable user-agent interaction in your `EvalSet`.

mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,7 @@ nav:
206206
- Evaluation:
207207
- evaluate/index.md
208208
- Criteria: evaluate/criteria.md
209+
- User Simulation: evaluate/user-sim.md
209210
- Safety and Security:
210211
- safety/index.md
211212
- Components:

0 commit comments

Comments
 (0)