You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue
Currently, multi-turn evals function as follows:
A sequence of user turns is predefined for each sample
Each user turn has an expected set of functions to have been called
Multi-turn generation is run as follows:
a. Feed user turn 1 to model
b. Parse response from model
c. Execute tool call from parsed response
d. Update state
e. Return to step a for next turn
f. Stop when no more turns remain
After running generation, eval does the following:
Check if state is valid, if not, give score of 0 to this entire sample. Move to next sample.
Compare ground truth state to model response state. If any function calls are missing in model response, give score of 0 to this entire sample. Move to next sample.
Assuming state and ground truth are both valid, return to step 1 for next turn.
If all turns had valid states with no missing commands, give score of 1 to this entire sample.
What is the issue
There are a few problems with this:
Previously correct turns are heavily punished
Example: For 4 turns, if the model gets the first 3 turns correct, but the last turn wrong, the entire sample is given a score of 0, equivalent to a model getting every turn wrong.
Future correct turns are heavily punished
Example: If the model gets the first turn wrong, but does the right thing in subsequent turns, the state is still considered "invalid" and thus the sample is given a score of 0, again equivalent to a model getting every turn wrong.
Benchmark becomes incredibly sensitive to noise
Example: If a particular user turn is ambiguous or has multiple interpretations not covered in the ground truth, the model will get the entire sample counted wrong even if it would've succeeded on subsequent turns, or if the future turns are correct with respect to the state it put itself in.
Benchmark does not reflect real-world use
Example: When a user asks a model to do something, and the model does something different/unexpected, the subsequent ground truth user turn is not the one a real-world user would send.
Proposed Changes
I propose we modify multi-turn evals to instead be single-turn from an established ground-truth state
Tangibly, what this would mean is each multi-turn sample is split into N samples, where N = number of turns.
For each of these, we define a ground truth state + previous assistant turns. The model is expected to provide the next turn in this state, and nothing else.
For metrics, we can continue with accuracy metrics, but we can now also specify "Turn-N" accuracy, which says "How accurate is my model after N turns?"
The text was updated successfully, but these errors were encountered:
Describe the issue
Currently, multi-turn evals function as follows:
a. Feed user turn 1 to model
b. Parse response from model
c. Execute tool call from parsed response
d. Update state
e. Return to step a for next turn
f. Stop when no more turns remain
After running generation, eval does the following:
What is the issue
There are a few problems with this:
Proposed Changes
I propose we modify multi-turn evals to instead be single-turn from an established ground-truth state
Tangibly, what this would mean is each multi-turn sample is split into N samples, where N = number of turns.
For each of these, we define a ground truth state + previous assistant turns. The model is expected to provide the next turn in this state, and nothing else.
For metrics, we can continue with accuracy metrics, but we can now also specify "Turn-N" accuracy, which says "How accurate is my model after N turns?"
The text was updated successfully, but these errors were encountered: