Skip to content

[Fix] Match verbose Chain-of-Thought in eval#1552

Open
ptonso wants to merge 1 commit into
open-compass:mainfrom
ptonso:fix/match-verbose-CoT
Open

[Fix] Match verbose Chain-of-Thought in eval#1552
ptonso wants to merge 1 commit into
open-compass:mainfrom
ptonso:fix/match-verbose-CoT

Conversation

@ptonso
Copy link
Copy Markdown

@ptonso ptonso commented May 21, 2026

Problem

exact match can_infer_option currently works well for terse multiple-choice answers, but fails on verbose responses such as:

The correct answer is B. Here's why... option B is correct because...

This happens because the current heuristic requires the option letter to appear uniquely and near the end of the response.

Why this matters

When extraction fails, VLMEvalKit routes the answer to the GPT judge. In MMBench circular evaluation, each question is evaluated across shuffled option permutations, and the group is correct only if all permutations are correct. This makes repeated judge fallback especially costly, since small extraction errors compound across permutations.

Context

This affects verbose models such as Gemma3, where the answer is often stated explicitly near the beginning of a chain-of-thought-style response. As a result, many parseable answers are unnecessarily sent to the GPT judge, producing unstable and unexpectedly low circular accuracy.

Proposed solution

Add a conservative fallback after the existing can_infer_option logic fails. The fallback extracts explicit answer declarations such as:

answer is B, correct answer is B, The correct answer is **B**

using:

(?i)(?:correct\s+)?answer\s+is\s+\**([ABCD])\**

The existing heuristic remains unchanged and is still tried first.

Results

Model Rows resolved without GPT Groups still routed to GPT Circular acc, regex-only (lower bound)
Gemma3-4B 14.4% → 97.9% 1746 → 80 5.6% → 65.6%
Gemma3-12B 20.5% → 92.6% 1646 → 237 11.4% → 69.3%
Gemma3-27B 28.1% → 80.4% 1577 → 560 14.6% → 58.6%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant