[Fix] Match verbose Chain-of-Thought in eval by ptonso · Pull Request #1552 · open-compass/VLMEvalKit

ptonso · 2026-05-21T01:54:13Z

Problem

exact match can_infer_option currently works well for terse multiple-choice answers, but fails on verbose responses such as:

The correct answer is B. Here's why... option B is correct because...

This happens because the current heuristic requires the option letter to appear uniquely and near the end of the response.

Why this matters

When extraction fails, VLMEvalKit routes the answer to the GPT judge. In MMBench circular evaluation, each question is evaluated across shuffled option permutations, and the group is correct only if all permutations are correct. This makes repeated judge fallback especially costly, since small extraction errors compound across permutations.

Context

This affects verbose models such as Gemma3, where the answer is often stated explicitly near the beginning of a chain-of-thought-style response. As a result, many parseable answers are unnecessarily sent to the GPT judge, producing unstable and unexpectedly low circular accuracy.

Proposed solution

Add a conservative fallback after the existing can_infer_option logic fails. The fallback extracts explicit answer declarations such as:

answer is B, correct answer is B, The correct answer is **B**

using:

(?i)(?:correct\s+)?answer\s+is\s+\**([ABCD])\**

The existing heuristic remains unchanged and is still tried first.

Results

Model	Rows resolved without GPT	Groups still routed to GPT	Circular acc, regex-only (lower bound)
Gemma3-4B	14.4% → 97.9%	1746 → 80	5.6% → 65.6%
Gemma3-12B	20.5% → 92.6%	1646 → 237	11.4% → 69.3%
Gemma3-27B	28.1% → 80.4%	1577 → 560	14.6% → 58.6%

[Fix] Match verbose Chain-of-Thought in eval

9bb7053

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] Match verbose Chain-of-Thought in eval#1552

[Fix] Match verbose Chain-of-Thought in eval#1552
ptonso wants to merge 1 commit into
open-compass:mainfrom
ptonso:fix/match-verbose-CoT

ptonso commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ptonso commented May 21, 2026

Problem

Why this matters

Context

Proposed solution

Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant