WIP: MLLaMA-Vision Model Implementation with Candle #2773

elk-cloner · 2025-02-15T20:18:58Z

Note: I'm new to Rust and actively learning while implementing this - feedback on idiomatic Rust patterns is very welcome! 🦀

Implementation Guide

Trying to implement meta-llama/Llama-3.2-11B-Vision by following the model conversion guide for pretrained model conversion.

Current Implementation Status

✅ Created configs related structs/module
✅ Implemented image processor (based on existing HugginFace python implementation)
✅ Added vision model (based on existing HuggingFace python implementation)

TODO

Vision Model Testing
- Compare outputs with reference implementation link
Language Model Implementation
- Investigate reusing existing LLaMA model with cross-attention layer
- Potential future enhancements:
  - KV cache
  - Speculative decoding
  - Continuous batching
Language Model Testing
MLLaMAConditionalGeneration Implementation
End-to-End Model Testing (Vision + Language)
Code Refactoring for Idiomatic Rust

Questions/Discussion Points

Can we leverage the existing LLaMA model implementation by adding cross-attention layer?
Looking for guidance on making the code more idiomatic Rust

Help Needed

Rust best practices and idioms
Any insights on model architecture decisions
Testing strategies

I appreciate any feedback or guidance, especially around Rust implementations and ML architecture decisions!

EricLBuehler · 2025-02-16T21:06:03Z

@elk-cloner this is super exciting! In case you are interested in a known-working implementation using Candle to help with any bugs, there is this: https://github.com/EricLBuehler/mistral.rs/tree/master/mistralrs-core/src/vision_models/mllama.

elk-cloner · 2025-02-16T21:29:56Z

Thanks a lot, @EricLBuehler. I wasn't aware of that repo. Does it still make sense to have it here? I've just checked the link quickly and noticed that I'm doing the same thing here.

LaurentMazare · 2025-02-19T09:47:21Z

I think it's a good thing to add to candle-transforrmers if you have the time to do it.

elk-cloner added 6 commits February 15, 2025 20:33

init config

1bb1422

config

7687bfa

mllama

14d993b

init

042f225

remove changes form llama example

eebd49b

Add MllamaCrossAttentionDecoderLayer (untested)

f03fd5e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: MLLaMA-Vision Model Implementation with Candle #2773

WIP: MLLaMA-Vision Model Implementation with Candle #2773

elk-cloner commented Feb 15, 2025

Uh oh!

EricLBuehler commented Feb 16, 2025

Uh oh!

elk-cloner commented Feb 16, 2025

Uh oh!

LaurentMazare commented Feb 19, 2025

Uh oh!

Uh oh!

WIP: MLLaMA-Vision Model Implementation with Candle #2773

Are you sure you want to change the base?

WIP: MLLaMA-Vision Model Implementation with Candle #2773

Conversation

elk-cloner commented Feb 15, 2025

Implementation Guide

Current Implementation Status

TODO

Questions/Discussion Points

Help Needed

Uh oh!

EricLBuehler commented Feb 16, 2025

Uh oh!

elk-cloner commented Feb 16, 2025

Uh oh!

LaurentMazare commented Feb 19, 2025

Uh oh!

Uh oh!