Region Captioning Eval on PLM-VideoBench

Hello,
I have checked Appendix C.3 for the evaluation protocol on PLM-VideoBench. The prompt you use to Llama to evaluate the region caption is written as the following:
"Your task is to compare a given pair of captions and provide a single score indicating how correct the pred is compared to GT, on a scale from 0 to 10. Focus on meaning and context, not exact word matches. Penalize missing and incorrect information, with lower scores for more significant errors. High scores require accurate conveyance of all key GT information. Respond with only the score, starting your response with the number and including no additional text. Output format: [score]."
However, there is no way the LLM would know which is the groundtruth and which is the prediction from the input. How can it identify the missing information. It seems you prompted with the output format but not the input format? Am I missing something?


Also can you explain how the aggregation of the metric occurs with the 1-10 scores to match the numbers in Table 5 at 46.6 are you summing the scores across the examples? Cause if you are averaging it should have been a number from 1-10.

Thank you,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Region Captioning Eval on PLM-VideoBench #95

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Region Captioning Eval on PLM-VideoBench #95

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions