Why the CLIP Score max value is around 30 ? #3142
-
|
Why the CLIP Score max value is around 30 instead of 100, because I think if the value range of similarity is [0,1], then after multiply 100, its value range become [0,100], but I found even though the image and the text are very match, the clip score value is just around 30 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
CLIPScore in torchmetrics is essentially |
Beta Was this translation helpful? Give feedback.
CLIPScore in torchmetrics is essentially
100 ⋅ cos(𝑓_{𝑖𝑚𝑔},𝑓_{𝑡𝑥𝑡})using L2‑normalized CLIP embeddings. Cosine reaches 1 only when the image and text embeddings are identical; with separate encoders and real data, well‑matched pairs typically land around cosine 0.25–0.35, which translates to scores near 25–35. The 100 factor is just a scaling convention (inspired by CLIP’s logit scale) and doesn’t mean “percent correctness.” Treat CLIPScore as a relative metric: compare models/captions with the same backbone and settings rather than expecting scores to approach 100.