Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Change multi-turn evals to single-turn from "ground truth" state #920

Open
jgreer013 opened this issue Feb 26, 2025 · 0 comments
Open

Comments

@jgreer013
Copy link
Contributor

Describe the issue
Currently, multi-turn evals function as follows:

  1. A sequence of user turns is predefined for each sample
  2. Each user turn has an expected set of functions to have been called
  3. Multi-turn generation is run as follows:
    a. Feed user turn 1 to model
    b. Parse response from model
    c. Execute tool call from parsed response
    d. Update state
    e. Return to step a for next turn
    f. Stop when no more turns remain

After running generation, eval does the following:

  1. Check if state is valid, if not, give score of 0 to this entire sample. Move to next sample.
  2. Compare ground truth state to model response state. If any function calls are missing in model response, give score of 0 to this entire sample. Move to next sample.
  3. Assuming state and ground truth are both valid, return to step 1 for next turn.
  4. If all turns had valid states with no missing commands, give score of 1 to this entire sample.

What is the issue
There are a few problems with this:

  • Previously correct turns are heavily punished
    • Example: For 4 turns, if the model gets the first 3 turns correct, but the last turn wrong, the entire sample is given a score of 0, equivalent to a model getting every turn wrong.
  • Future correct turns are heavily punished
    • Example: If the model gets the first turn wrong, but does the right thing in subsequent turns, the state is still considered "invalid" and thus the sample is given a score of 0, again equivalent to a model getting every turn wrong.
  • Benchmark becomes incredibly sensitive to noise
    • Example: If a particular user turn is ambiguous or has multiple interpretations not covered in the ground truth, the model will get the entire sample counted wrong even if it would've succeeded on subsequent turns, or if the future turns are correct with respect to the state it put itself in.
  • Benchmark does not reflect real-world use
    • Example: When a user asks a model to do something, and the model does something different/unexpected, the subsequent ground truth user turn is not the one a real-world user would send.

Proposed Changes
I propose we modify multi-turn evals to instead be single-turn from an established ground-truth state

Tangibly, what this would mean is each multi-turn sample is split into N samples, where N = number of turns.

For each of these, we define a ground truth state + previous assistant turns. The model is expected to provide the next turn in this state, and nothing else.

For metrics, we can continue with accuracy metrics, but we can now also specify "Turn-N" accuracy, which says "How accurate is my model after N turns?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant