Loading only experts into RAM or VRAM #12071

ejrydhfs · 2025-02-25T18:13:50Z

ejrydhfs
Feb 25, 2025

Could Deepseek R1 and other MoE models be made to only load a few of the 37b parameter experts from disk into ram and/or VRAM instead of the entire model? It would cause some latency after the prompt is sent to the model and just before token generation starts while the experts are loaded

Would it also be possible to only load the experts into VRAM from RAM?

abc-nix · 2025-02-25T20:04:27Z

abc-nix
Feb 25, 2025

Please read the merge request for micromanaging tensor offload #11397

0 replies

ejrydhfs · 2025-04-06T08:19:32Z

ejrydhfs
Apr 6, 2025
Author

From what I understand after reading this: https://www.reddit.com/r/LocalLLaMA/comments/1jshwxe/first_results_are_in_llama_4_maverick_17b_active/ this would require the creation of a model to predict which experts will be loaded

0 replies

ejrydhfs · 2025-04-06T08:22:47Z

ejrydhfs
Apr 6, 2025
Author

The merge request has been closed so I will copy what I said there: would it be possible to have llama.cpp only load some experts from disk to ram or vram, or from ram to vram, on demand? but it would come at the cost of latency after the prompt is sent to the model

i am not sure if this is similar, but would it also be possible to implement keeping several instances of experts or most used tensors on each compute device to increase inference speed for common queries, and also perform separation of each expert into commonly used and rarely used neurons aka hot and cold neurons respectively like powerinfer and powerinfer 2 do?

Would it also be possible to perform sharding of the model to achieve tensor parallelization between different types of devices like CPUs with GPUs using the hot and cold neurons approach, on any kind of AI model?

1 reply

ejrydhfs Apr 20, 2025
Author

From my understanding there are several kinds of layers on AI models maybe it would be possible to offload by type of layer instead of powerinfer

lexasub · 2025-04-06T10:53:20Z

lexasub
Apr 6, 2025

i found some interesting - moe on multiple gpu optimization - https://arxiv.org/pdf/2504.02263 https://github.com/wassemgtk/MegaScale-Infer-Prototyp

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading only experts into RAM or VRAM #12071

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Loading only experts into RAM or VRAM #12071

ejrydhfs Feb 25, 2025

Replies: 4 comments · 1 reply

abc-nix Feb 25, 2025

ejrydhfs Apr 6, 2025 Author

ejrydhfs Apr 6, 2025 Author

ejrydhfs Apr 20, 2025 Author

lexasub Apr 6, 2025

ejrydhfs
Feb 25, 2025

Replies: 4 comments 1 reply

abc-nix
Feb 25, 2025

ejrydhfs
Apr 6, 2025
Author

ejrydhfs
Apr 6, 2025
Author

ejrydhfs Apr 20, 2025
Author

lexasub
Apr 6, 2025