Replies: 4 comments 1 reply
-
Please read the merge request for micromanaging tensor offload #11397 |
Beta Was this translation helpful? Give feedback.
-
From what I understand after reading this: https://www.reddit.com/r/LocalLLaMA/comments/1jshwxe/first_results_are_in_llama_4_maverick_17b_active/ this would require the creation of a model to predict which experts will be loaded |
Beta Was this translation helpful? Give feedback.
-
The merge request has been closed so I will copy what I said there: would it be possible to have llama.cpp only load some experts from disk to ram or vram, or from ram to vram, on demand? but it would come at the cost of latency after the prompt is sent to the model i am not sure if this is similar, but would it also be possible to implement keeping several instances of experts or most used tensors on each compute device to increase inference speed for common queries, and also perform separation of each expert into commonly used and rarely used neurons aka hot and cold neurons respectively like powerinfer and powerinfer 2 do? Would it also be possible to perform sharding of the model to achieve tensor parallelization between different types of devices like CPUs with GPUs using the hot and cold neurons approach, on any kind of AI model? |
Beta Was this translation helpful? Give feedback.
-
i found some interesting - moe on multiple gpu optimization - https://arxiv.org/pdf/2504.02263 https://github.com/wassemgtk/MegaScale-Infer-Prototyp |
Beta Was this translation helpful? Give feedback.
-
Could Deepseek R1 and other MoE models be made to only load a few of the 37b parameter experts from disk into ram and/or VRAM instead of the entire model? It would cause some latency after the prompt is sent to the model and just before token generation starts while the experts are loaded
Would it also be possible to only load the experts into VRAM from RAM?
Beta Was this translation helpful? Give feedback.
All reactions