Skip to content
Discussion options

You must be logged in to vote

CLIPScore in torchmetrics is essentially 100 ⋅ cos⁡(𝑓_{𝑖𝑚𝑔},𝑓_{𝑡𝑥𝑡}) using L2‑normalized CLIP embeddings. Cosine reaches 1 only when the image and text embeddings are identical; with separate encoders and real data, well‑matched pairs typically land around cosine 0.25–0.35, which translates to scores near 25–35. The 100 factor is just a scaling convention (inspired by CLIP’s logit scale) and doesn’t mean “percent correctness.” Treat CLIPScore as a relative metric: compare models/captions with the same backbone and settings rather than expecting scores to approach 100.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by Borda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants