[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

jgreer013 · 2025-02-26T16:25:35Z

Describe the issue
Currently, multi-turn evals function as follows:

A sequence of user turns is predefined for each sample
Each user turn has an expected set of functions to have been called
Multi-turn generation is run as follows:
a. Feed user turn 1 to model
b. Parse response from model
c. Execute tool call from parsed response
d. Update state
e. Return to step a for next turn
f. Stop when no more turns remain

After running generation, eval does the following:

Check if state is valid, if not, give score of 0 to this entire sample. Move to next sample.
Compare ground truth state to model response state. If any function calls are missing in model response, give score of 0 to this entire sample. Move to next sample.
Assuming state and ground truth are both valid, return to step 1 for next turn.
If all turns had valid states with no missing commands, give score of 1 to this entire sample.

What is the issue
There are a few problems with this:

Previously correct turns are heavily punished
- Example: For 4 turns, if the model gets the first 3 turns correct, but the last turn wrong, the entire sample is given a score of 0, equivalent to a model getting every turn wrong.
Future correct turns are heavily punished
- Example: If the model gets the first turn wrong, but does the right thing in subsequent turns, the state is still considered "invalid" and thus the sample is given a score of 0, again equivalent to a model getting every turn wrong.
Benchmark becomes incredibly sensitive to noise
- Example: If a particular user turn is ambiguous or has multiple interpretations not covered in the ground truth, the model will get the entire sample counted wrong even if it would've succeeded on subsequent turns, or if the future turns are correct with respect to the state it put itself in.
Benchmark does not reflect real-world use
- Example: When a user asks a model to do something, and the model does something different/unexpected, the subsequent ground truth user turn is not the one a real-world user would send.

Proposed Changes
I propose we modify multi-turn evals to instead be single-turn from an established ground-truth state

Tangibly, what this would mean is each multi-turn sample is split into N samples, where N = number of turns.

For each of these, we define a ground truth state + previous assistant turns. The model is expected to provide the next turn in this state, and nothing else.

For metrics, we can continue with accuracy metrics, but we can now also specify "Turn-N" accuracy, which says "How accurate is my model after N turns?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

jgreer013 commented Feb 26, 2025

[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

Comments

jgreer013 commented Feb 26, 2025