[Fix] Match verbose Chain-of-Thought in eval#1552
Open
ptonso wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
exact match
can_infer_optioncurrently works well for terse multiple-choice answers, but fails on verbose responses such as:The correct answer is B. Here's why... option B is correct because...This happens because the current heuristic requires the option letter to appear uniquely and near the end of the response.
Why this matters
When extraction fails, VLMEvalKit routes the answer to the GPT judge. In MMBench circular evaluation, each question is evaluated across shuffled option permutations, and the group is correct only if all permutations are correct. This makes repeated judge fallback especially costly, since small extraction errors compound across permutations.
Context
This affects verbose models such as Gemma3, where the answer is often stated explicitly near the beginning of a chain-of-thought-style response. As a result, many parseable answers are unnecessarily sent to the GPT judge, producing unstable and unexpectedly low circular accuracy.
Proposed solution
Add a conservative fallback after the existing
can_infer_optionlogic fails. The fallback extracts explicit answer declarations such as:answer is B,correct answer is B,The correct answer is **B**using:
(?i)(?:correct\s+)?answer\s+is\s+\**([ABCD])\**The existing heuristic remains unchanged and is still tried first.
Results