You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there an interface planned for multimodal embeddings? We'd love to contribute one that accepts interleaved text and images, similar to how Anthropic does content blocks.
The text was updated successfully, but these errors were encountered:
We don't have anything planned today, yet! So given the content blocks example, is the idea that you would accept an array of text or images, interleaved, and then embeddings would be generated based on the content? I assume this means the model would be a multimodal embedding model, like CLIP, for example?
Yup that's exactly what I'm thinking. The ability to accept content blocks would help a lot with RAG applications as well, as the full retrieved documents could be sent directly the the LLM.
Definitely open to exploring this. If you have a proposal for an interface for multimodal embeddings, definitely curious. I also realize the current structure of text-only vectorizers is a bit rigid. Might be a better solution to packaging support for text, image, and multimodal all in a single streamlined interface. Open to suggestions!
@tylerhutcherson I created a proposal on how multimodal embeddings could work and added a reference implementation with VoyageAI. I created a Draft PR: #294
Is there an interface planned for multimodal embeddings? We'd love to contribute one that accepts interleaved text and images, similar to how Anthropic does content blocks.
The text was updated successfully, but these errors were encountered: