DeepseekV3.2 lightning indexer design #17692

am17an · 2025-12-02T13:52:37Z

am17an
Dec 2, 2025
Collaborator

Deepseek's new sparse model enables very long context performance. What would be acceptable design for the lighting indexer?

For reference this is the attention mechanism design from tech report

and here is the hugging-face reference impl: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/inference/model.py#L435

Technical Report here

jukofyork · 2025-12-02T17:32:29Z

jukofyork
Dec 2, 2025
Collaborator

I can see two ways to do this, but neither are that appealing:

Extend the attention functions to allow a sparse mask to be passed in something akin to (boolean) compressed sparse row format (ie: no need for a "values" array as in normal CSR format). I don't think any of the fancier GPU-optimised sparse formats would be needed due to there always being a fixed number of entries (2048 for deepaeek-3.2 IIRC).
Do something similar to the original MLA code and manually store the extra k_s values needed for the lightning indexer and try to use the existing mul_mat_id operation. I think the mul_mat_id operation has been tuned for much more course grained / low-k scenarios though, so it might not work well for this without some special cases adding.

Neither option seems to fit with the existing codebase very well though :/

2 replies

am17an Dec 3, 2025
Collaborator Author

mul_mat_id operation has been tuned for much more course grained / low-k scenarios though, so it might not work well for this without some special cases adding.

I don't think this is true in CUDA at least, it should run quite fast for K=2048. However, we don't support native FP8 multiplication, but INT8 should have the same throughput. So q8_0 should be comparable

jukofyork Dec 3, 2025
Collaborator

I don't think FP8 should really be considered yet.

The potential gains are clear regardless of the type:

So per cached token:

Indexer cost = O(n_indexer_heads * indexer_dim) = 64 * 128 = ~8192
MLA cost = O(n_heads * (mla_dim + rope_dim)) = 128 * (512 + 64) = ~73728

Solving:

2048 * 73728 + n * 8192 = n * 73728 --> n = 2304

We've ignored the RELU and top-k from the indexer, and the Softmax and v_proj from the attention costs, so this isn't that far from the expected break-even n.

and plotting the expected gains:

So regardless of any effect FP8 will have, the gains for the attention calculation are quite significant for all types!

BUT: The biggest gains will be for those fully offloading or fully running on CPU, and I suspect the majority using llama.cpp are offloading the large MoE tensors and will see much more moderate/smaller gains...

jukofyork · 2025-12-02T17:34:18Z

jukofyork
Dec 2, 2025
Collaborator

I tried to outline how the second option could be done in this post:

#16331 (comment)

0 replies

DocShotgun · 2025-12-05T23:30:38Z

DocShotgun
Dec 5, 2025

Does this mean that Deepseek v3.2 support is coming to llama.cpp SoonTM?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeepseekV3.2 lightning indexer design #17692

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DeepseekV3.2 lightning indexer design #17692

Uh oh!

Uh oh!

am17an Dec 2, 2025 Collaborator

Replies: 3 comments · 2 replies

Uh oh!

Uh oh!

jukofyork Dec 2, 2025 Collaborator

Uh oh!

am17an Dec 3, 2025 Collaborator Author

Uh oh!

Uh oh!

jukofyork Dec 3, 2025 Collaborator

Uh oh!

jukofyork Dec 2, 2025 Collaborator

Uh oh!

DocShotgun Dec 5, 2025

am17an
Dec 2, 2025
Collaborator

Replies: 3 comments 2 replies

jukofyork
Dec 2, 2025
Collaborator

am17an Dec 3, 2025
Collaborator Author

jukofyork Dec 3, 2025
Collaborator

jukofyork
Dec 2, 2025
Collaborator

DocShotgun
Dec 5, 2025