-
Notifications
You must be signed in to change notification settings - Fork 108
Description
Hello,
I have checked Appendix C.3 for the evaluation protocol on PLM-VideoBench. The prompt you use to Llama to evaluate the region caption is written as the following:
"Your task is to compare a given pair of captions and provide a single score indicating how correct the pred is compared to GT, on a scale from 0 to 10. Focus on meaning and context, not exact word matches. Penalize missing and incorrect information, with lower scores for more significant errors. High scores require accurate conveyance of all key GT information. Respond with only the score, starting your response with the number and including no additional text. Output format: [score]."
However, there is no way the LLM would know which is the groundtruth and which is the prediction from the input. How can it identify the missing information. It seems you prompted with the output format but not the input format? Am I missing something?
Also can you explain how the aggregation of the metric occurs with the 1-10 scores to match the numbers in Table 5 at 46.6 are you summing the scores across the examples? Cause if you are averaging it should have been a number from 1-10.
Thank you,