Skip to content

Conversation

@Silviase
Copy link
Collaborator

@Silviase Silviase commented May 16, 2025

  • eval_all.sh への追加
  • src/eval_mm/metrics/__init__.py への追加
  • src/eval_mm/metrics/scorer_registry.pyへの追加
  • src/eval_mm/metrics/**_scorer.py の実装(Optional)
  • src/eval_mm/tasks/__init__.pyへの追加
  • src/eval_mm/tasks/task_registry.pyへの追加
  • src/eval_mm/tasks/**.pyの実装

@Silviase Silviase linked an issue May 16, 2025 that may be closed by this pull request
@Silviase Silviase requested a review from speed1313 May 16, 2025 02:45
@Silviase
Copy link
Collaborator Author

Qwen2.5-VL-Instruct-7Bを評価したところ,0.748程度でした. precisionが0.9程度でrecallが0.5くらいと見ると、読めていない細かい文字があると見るのが妥当でしょうか。150枚なのでそこまでカバレッジは広くないですが、傾向は観れるかもしれないです。
こちらも動作確認をお願いします。

@Silviase
Copy link
Collaborator Author

CVQAのjapanese subsetも追加しました。

@speed1313 speed1313 requested a review from Copilot May 17, 2025 15:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds support for a Japanese subset of the CC-OCR and CVQA tasks into the eval_mm framework, wiring them through the task and metric registries and enabling end-to-end evaluation.

  • Registers CCOCR and CVQA in task and scorer registries
  • Implements CC-OCR and CVQA Task classes with quick local tests
  • Updates eval_all.sh to include the new tasks

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/eval_mm/tasks/task_registry.py Register new CCOCR and CVQA tasks
src/eval_mm/tasks/cc_ocr.py Implement CCOCR Task
src/eval_mm/tasks/cvqa.py Implement CVQA Task
src/eval_mm/tasks/init.py Export new task classes
src/eval_mm/metrics/scorer_registry.py Register new CCOCRScorer and CVQAScorer
src/eval_mm/metrics/cvqa_scorer.py Implement CVQA scorer
src/eval_mm/metrics/init.py Export new scorer classes
eval_all.sh Include CC-OCR and CVQA in evaluation script
Comments suppressed due to low confidence (3)

src/eval_mm/metrics/scorer_registry.py:29

  • The registry key "mmmu" looks like a typo; it should be "mmmlu" to match the task ID and ensure the MMMLU scorer is correctly registered.
"mmmu": MMMUScorer,

src/eval_mm/tasks/cvqa.py:77

  • [nitpick] The helper function is named test_task, which may collide with similar helpers; rename it to test_cvqa_task for clarity and to avoid conflicts.
def test_task():

src/eval_mm/tasks/cc_ocr.py:53

  • [nitpick] The helper function is named test_task; consider renaming it to test_ccocr_task to clearly associate it with the CCOCR task.
def test_task():

return f"{question}\n{parsed_options}\n\n{MULTI_CHOICE_PROMPT}"


class CVQA(Task):
Copy link

Copilot AI May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Adding a class docstring to CVQA would help future maintainers understand the dataset source, prompt structure, and filter criteria.

Suggested change
class CVQA(Task):
class CVQA(Task):
"""
CVQA is a task for evaluating visual question answering in the context of a specific dataset.
This task uses the `afaji/cvqa` dataset, filtered to include only entries where the `Subset` field
matches `('Japanese', 'Japan')`. The dataset includes questions, multiple-choice options, and images.
The input prompt is constructed by combining the question and its options, formatted as a multiple-choice
question. The user is instructed to select the most appropriate answer by providing the corresponding letter.
Key methods:
- `_prepare_dataset`: Prepares and filters the dataset, and constructs input prompts.
- `doc_to_text`: Extracts the input text for a given document.
- `doc_to_visual`: Extracts the associated image(s) for a given document.
- `doc_to_id`: Retrieves the unique identifier for a given document.
- `doc_to_answer`: Retrieves the correct answer for a given document.
"""

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@speed1313 speed1313 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substring-matchだけ対応お願いします

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src/eval_mm/metrics/substring_match_scorer.py
こちらで代替していただけますか?

@speed1313
Copy link
Collaborator

gemma-3-4b-it on cc-ocr

{"question_id": "0", "text": "終点線 (特色・快速)\nSōbu Line\n船橋・津田沼・千葉方面\nfor Funabashi, Tsudanuma & Chiba\n\n快速 20:36\n快速 20:40\n千葉 15両\n津田沼 15両", "answer": "4 総武線(特急·快速) Sōbu Line 船橋·津田沼·千葉方面 for Funabashi, Tsudanuma & Chiba 快速 20:36 千 葉 15両 快速 20:40 津田沼 15両", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.8969696964588428}
{"question_id": "1", "text": "本0限定!!\n鶴ちゃんこ鍋 が\n一人前 ¥500", "answer": "本日限定!! 鶏ちゃんご鍋が 一人前¥500 (税込) ワソコイソ", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.6122448974510619}
{"question_id": "2", "text": "この出口は\n海抜 3.4m\nAbove sea level\n해발\n東京都交通局", "answer": "この出入口は 海抜 Above sea level 海拔 해발 3.4m 東京都交通局", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.9552238800695032}
{"question_id": "3", "text": "BS\nブリヂストン\nスイッチサインを確かめましょう", "answer": "BS ブリヂストン スリップサインを確かめましょう", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.913043477721172}
{"question_id": "4", "text": "コインランドリー\n八千代", "answer": "コインランドリー 八千代 アパ CO*N LM* 年中無体 お湯で洗たく マンション D-1 yDo 1 *い *** ***", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.3999999996654545}
{"question_id": "5", "text": "ありません。", "answer": "2F ポプラ 珈琲 *** 休業 *** *** Barドウム 店名 3月6日(日)まで 2月21日(月)~ 休*いたします。 *コロナウイ*ス感染症の拡大*止のため *の*力要*内容A)に基づ* ポプラ", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.04705882339709341}
{"question_id": "6", "text": "鏡はししわ\nGinbeko\n(The First Old Sake Brewery)\n角/バイキン道場\n(Kakigoriya Akiba)\n土浦の家\nどうもありがとう", "answer": "築地 銀だこ GINDACO 鰻 はしな 角ハイボール 酒場 tghball akaba *ighball Sakaba 圡用うなき うなぎ はしな 九州熱中屋 gragmmo 2F 矢澤 坐うま とうがらい 蓬莱 中国 料理 *** ***", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.4146341458424301}
{"question_id": "7", "text": "金泽ホテルモン\nキリン\n岐京屋\n290", "answer": "龍角散 龍 ダイレクト ばんしゃ*家 *料理 餋老乃龍 黒ホル 情ホル丁 情*ル マサヒ スー*ライ 生中 290 *品 KIRIN KIRIN KIRIN *情熱ホルモソ *ホル 情熱ホルモソ JONETSU HORUMON 酒* 出*丁 *家思い KIRIN キリンビール ラーメン 餃子 岐阜屋 質 *** *広小* *ウラ イニツ商事 *** -3672 ***", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.15483870949028095}
{"question_id": "8", "text": "クリーニング\n10:30から\n会員電話番号\nクリーナー 130円", "answer": "10-17 クリーニンク おしゃれ洗濯 じゃぷじゃぷ 会員価格 カッター 白色(立体) 130円 *** じゃぷじゃぷ *店 *** 10:30* お預かり 年*6* お*し!", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.39999999957584487}
{"question_id": "9", "text": "天谷町 竹馬", "answer": "*たさと 竹馬 天ぶら 京料理 ー* *け* ***", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.27272727235123967}

画質的に難しそうなものがいくつかありそうですね.
{"question_id": "9", "text": "天谷町 竹馬", "answer": "たさと 竹馬 天ぶら 京料理 ー ***", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.27272727235123967}
スクリーンショット 2025-05-18 0 42 26

@Silviase
Copy link
Collaborator Author

ありがとうございます、助かります

@speed1313
Copy link
Collaborator

llm-jp-vila on cc-ocr
OCR以外も出力してしまってInstruction Following能力が低いためにスコアが下がってしまいますね. questionが英語のせいもあるかもですが.

{"question_id": "0", "text": "The image shows a train station with a digital display board. The board displays the time of the next train, which is 20:36. The board also shows the destination of the train, which is for Funabashi, Tsudanuma, and Chiba. The board is located above a platform, and there are two cameras mounted on the ceiling.", "answer": "4 総武線(特急·快速) Sōbu Line 船橋·津田沼·千葉方面 for Funabashi, Tsudanuma & Chiba 快速 20:36 千 葉 15両 快速 20:40 津田沼 15両", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.23008849520407934}

sarashina 14b on cc-ocr

{"question_id": "0", "text": " 4. Sōbu Line for Funabashi, Tsudanuma & Chiba. 20:36 Chiba 15. 20:40 Tsudanuma 15.", "answer": "4 総武線(特急·快速) Sōbu Line 船橋·津田沼·千葉方面 for Funabashi, Tsudanuma & Chiba 快速 20:36 千 葉 15両 快速 20:40 津田沼 15両", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.6710526310743593}
{"question_id": "1", "text": " 本日限定!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", "answer": "本日限定!! 鶏ちゃんご鍋が 一人前¥500 (税込) ワソコイソ", "input_text": "Please output only the text content from the image without any additional descriptions or formatting.", "cc-ocr": 0.027874564278065776}

全体の結果

| Model                                    | CC-OCR/macro_f1   | CVQA/Acc   |
|:-----------------------------------------|:------------------|:-----------|
| stabilityai/japanese-instructblip-alpha  | 0.0               | 0.24       |
| SakanaAI/Llama-3-EvoVLM-JP-v2            | 0.1               | 0.44       |
| cyberagent/llava-calm2-siglip            | 0.11              | 0.07       |
| llm-jp/llm-jp-3-vila-14b                 | 0.12              | 0.3        |
| sbintuitions/sarashina2-vision-8b        | 0.08              | 0.49       |
| sbintuitions/sarashina2-vision-14b       | 0.23              | 0.56       |
| llava-hf/llava-1.5-7b-hf                 | 0.14              | 0.41       |
| llava-hf/llava-v1.6-mistral-7b-hf        | 0.2               | 0.27       |
| neulab/Pangea-7B-hf                      | 0.19              | 0.48       |
| mistralai/Pixtral-12B-2409               | 0.27              | 0.49       |
| meta-llama/Llama-3.2-11B-Vision-Instruct | 0.28              | 0.52       |
| Efficient-Large-Model/VILA1.5-13b        | 0.21              | 0.44       |
| OpenGVLab/InternVL2-8B                   | 0.37              | 0.48       |
| OpenGVLab/InternVL2-26B                  | 0.39              | 0.5        |
| Qwen/Qwen2.5-VL-7B-Instruct              | <u>0.8</u>        | 0.56       |
| Qwen/Qwen2.5-VL-72B-Instruct             | **0.8**           | <u>0.7</u> |
| google/gemma-3-4b-it                     | 0.55              | 0.49       |
| google/gemma-3-12b-it                    | 0.65              | 0.59       |
| google/gemma-3-27b-it                    | 0.67              | 0.63       |
| microsoft/Phi-4-multimodal-instruct      | 0.42              | 0.37       |
| gpt-4o-2024-11-20                        | 0.64              | **0.8**    |
| turing-motors/Heron-NVILA-Lite-15B       | 0.45              | 0.61       |

@speed1313 speed1313 merged commit 77cc07a into master May 18, 2025
1 check passed
@speed1313 speed1313 deleted the 167-task-ccocr branch May 18, 2025 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CC-OCR の日本語サブセットの追加

3 participants