Skip to content

Region Captioning Eval on PLM-VideoBench #95

@MSiam

Description

@MSiam

Hello,
I have checked Appendix C.3 for the evaluation protocol on PLM-VideoBench. The prompt you use to Llama to evaluate the region caption is written as the following:
"Your task is to compare a given pair of captions and provide a single score indicating how correct the pred is compared to GT, on a scale from 0 to 10. Focus on meaning and context, not exact word matches. Penalize missing and incorrect information, with lower scores for more significant errors. High scores require accurate conveyance of all key GT information. Respond with only the score, starting your response with the number and including no additional text. Output format: [score]."
However, there is no way the LLM would know which is the groundtruth and which is the prediction from the input. How can it identify the missing information. It seems you prompted with the output format but not the input format? Am I missing something?

Also can you explain how the aggregation of the metric occurs with the 1-10 scores to match the numbers in Table 5 at 46.6 are you summing the scores across the examples? Cause if you are averaging it should have been a number from 1-10.

Thank you,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions