Are there plans to integrate multimodal prompts instead of text only? This would be very helpful to evaluate in-context learning.