evaluation results of InternVL2_5-2B on GSM8K dosen't match with that in paper. #951

lynshwoo2022 · 2025-03-11T11:54:34Z

In table 13 of your paper ”Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling” shows that the result of InternVL2_5-2B on GSM8K（4-shot) is about 55, but I tried on myself which only gets around 37.

I'm thinking if it's the prompt i use on InternVL2_5-2B is not the most effective. Could you please show me some examples activating and guiding InternVL2_5 to generate the best answer?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation results of InternVL2_5-2B on GSM8K dosen't match with that in paper. #951

evaluation results of InternVL2_5-2B on GSM8K dosen't match with that in paper. #951

lynshwoo2022 commented Mar 11, 2025 •

edited

Loading

evaluation results of InternVL2_5-2B on GSM8K dosen't match with that in paper. #951

evaluation results of InternVL2_5-2B on GSM8K dosen't match with that in paper. #951

Comments

lynshwoo2022 commented Mar 11, 2025 • edited Loading

lynshwoo2022 commented Mar 11, 2025 •

edited

Loading