Mamba2 in Llama.cpp #9196

awgr · 2024-08-27T02:40:43Z

awgr
Aug 27, 2024

I mentioned on one of your MRs; we have an implementation of Mamba2 as part of our model. It includes quite a lot of changes in ggml (adding expf, prefix sum, softplus, and ssd_segsum). There are some untested parts (e.g. recurrent state caching) though I expect a somewhat hacky implementation will work. In our model, we need a more flexible state cache (for both kv and other recurrent states).

I'm discussing with people about trying to get it ready for contrib, but I suspect you're working on this already, and our fork is rather different; thoughts?

I'm comfortable with just pasting the specific and relevant dense parts here too, if they're a useful resource for you in your ongoing work, but if you'd prefer, and it doesn't conflict with your (stellar) effort on SSM/SSD models, we can toss up an MR instead.

compilade · 2024-08-27T15:24:25Z

compilade
Aug 27, 2024
Collaborator

@compilade

I mentioned on one of your MRs; we have an implementation of Mamba2 as part of our model. It includes quite a lot of changes in ggml (adding expf, prefix sum, softplus, and ssd_segsum).

@awgr

Yes, I saw your comment, and I'd like to see that code eventually. I'm curious about SSD in ggml, and how it compares to the recurrence and/or if both should be used, selecting one or the other depending on the batch size.

There are some untested parts (e.g. recurrent state caching) though I expect a somewhat hacky implementation will work. In our model, we need a more flexible state cache (for both kv and other recurrent states).

This reminds me that both of these things (state checkpoints and separate KV cache) are implemented since a while ago in the Jamba PR (#7531), although I need to update it with the latest changes on master.

So you're working on a hybrid Mamba-2 model? Interesting!

By "recurrent state caching", do you mean your implementation of SSD always starts from a blank state? (because I think it's done that way in HF transformers too)

I'm assuming you meant state checkpoints as in Jamba. A lot of the paragraphs below assume your implementation can take existing states as input.

I'm discussing with people about trying to get it ready for contrib, but I suspect you're working on this already, and our fork is rather different; thoughts?

Actually, I think we're taking different approaches here, which might not totally conflict, except only because the model graph needs to keep the same structure between runs (which will be important when porting this to GPU, because it avoids having to re-synchronize the graph). I was initially focusing on optimizing the recurrence for text generation, so I did not really put much time towards implementing SSD yet, so we're alright on this point. I think SSD should be good for long contexts. And your approach should also not conflict with the CUDA implementation of the SSM scan in #9186 (not yet adapted to Mamba-2), although it does add new operators to implement.

I'm actually curious about how semistructured matrices perform at small batch sizes. If it's slower than the recurrence, with how many tokens per batch does it become faster? (If it's always faster, then the choice would be so much easier)

One problem I've noticed with recurrence with Mamba-2 is that the states are so big that copying them takes a non-negligible amount of time when generating text. On my laptop, with a pure Mamba-2 model, this takes at least 10% of the CPU time (excluding the actual SSM scan), and is not necessarily specific to how the SSM is calculated, but how the states are prepared at each iteration.

I'm currently trying to figure out a way to avoid copies when not necessary. I don't know if it will also be easy to apply to SSD because that depends on how it is implemented.

I'm comfortable with just pasting the specific and relevant dense parts here too, if they're a useful resource for you in your ongoing work, but if you'd prefer, and it doesn't conflict with your (stellar) effort on SSM/SSD models, we can toss up an MR instead.

I'd like to see your actual branch. I suggest you push it to a fork of llama.cpp which you control. Until then, I can only guess how much it would conflict. I still want to give you proper attribution for your work.

If there's somehow a way to keep the graph structure constant and always use the fastest implementation for the batch size, that would be very great, but might require other changes in ggml. If SSD can only start from blank states, then more elaborate ubatch splitting could help. Still, most of this is speculation because I don't have access to your implementation.

0 replies

awgr · 2024-08-29T09:11:46Z

awgr
Aug 29, 2024
Author

@compilade

I will follow up more on this very soon; I'm expecting to have everything ready by the end of the second week of September, or earlier.

Yes, I saw your comment, and I'd like to see that code eventually. I'm curious about SSD in ggml, and how it compares to the recurrence and/or if both should be used, selecting one or the other depending on the batch size.

An interesting experiment, the work of designing a mechanism that selects which to use seems useful.

This reminds me that both of these things (state checkpoints and separate KV cache) are implemented since a while ago in the Jamba PR (#7531), although I need to update it with the latest changes on master.

I saw your work on Jamba! It helped me to understand several things about ggml that were essential in getting the SSD implementation together. Same with your existing mamba 1 implementation.

So you're working on a hybrid Mamba-2 model? Interesting!

Yes! There are several models in the series. Here is a small one: https://huggingface.co/Zyphra/Zamba2-1.2B

By "recurrent state caching", do you mean your implementation of SSD always starts from a blank state? (because I think it's done that way in HF transformers too)

I realized I've miscommunicated here; due to my background I think of the states that are described by a recurrence relation over sequential forward passes as recurrent state, but I understand that's not a canonical way of thinking about it. I'm referring specifically to what are often called kv cache and ssm/conv cache.

the model graph needs to keep the same structure between runs

Thank you for mentioning this; yes, there are some parts that are a bit problematic as a result. Regarding kernel and GPU support, the current implementation is fairly rough and has very little.

I'm actually curious about how semistructured matrices perform at small batch sizes.

I don't have profiling ready yet, but once I do I can probably answer this to some degree.

but how the states are prepared at each iteration.

Yes, I think the first version unfortunately will be very inefficient...in the current impl, preparing of various states is extensive, and likely more than even naively necessary.

I'll have a branch available for review relatively soon. It might be, oh, a week before the entire model is ready though. For clarification, the SSD implementation is complete, but the model implementation is not yet finished.

0 replies

awgr · 2024-10-13T08:38:03Z

awgr
Oct 13, 2024
Author

@compilade I have the working implementation now; I added a generic/flexible caching component to llama.cpp for tensors with a static shape, and there is no prefill to circumvent the issues with the static cgraph.

Otherwise, it's seems to work well; there were quite a few small issues, and I need to clean up the implementation quite a lot, but it's more or less possible to review it, though I doubt it's quite ready for contribution.

4 replies

awgr Oct 13, 2024
Author

caches, e.g.

caches_init: initializing generic caches
caches_init: initializing cache metadata
caches_init::conv: initializing with 54 (f32) tensors of shape {4, 5248, 0, 0}
caches_init::conv: buffer initialization
caches_init::conv: context map initialization
caches_init::conv: tensors initialization
caches_init::conv: allocating buffers
caches_init::conv:        CPU cache buffer size =     4.32 MiB
caches_init::conv: complete
caches_init::ssm: initializing with 54 (f32) tensors of shape {64, 64, 80, 0}
caches_init::ssm: buffer initialization
caches_init::ssm: context map initialization
caches_init::ssm: tensors initialization
caches_init::ssm: allocating buffers
caches_init::ssm:        CPU cache buffer size =    67.50 MiB
caches_init::ssm: complete
llama_new_context_with_model: f32 cache: (conv):    4.32 MiB
llama_new_context_with_model: f32 cache: (ssm):   71.82 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =    80.79 MiB
llama_new_context_with_model: graph nodes  = 3670
llama_new_context_with_model: graph splits = 1

awgr Oct 24, 2024
Author

@compilade Hey - I've been working towards optimizing the mamba2 blocks for the hybrid model; there are a few complications I'd love your advice on, if you have a moment:

I may need to implement a hacked/custom mamba2 kernel to accommodate the caching implementation for the hybrid model. This might be something we have to resolve when we contrib the model. Would you like us to contrib the caching implementation?
In your experience, with the style of small ops that mamba has near the top of the block, is there some cost from having those unary ops be on CPU? It seems like it introduces splits, but they can be partially resolved by op order (i.e. forward expand the mixture of dt and A at the same time as z, x, b, c,). I'm considering trying to get some of the bits of the model that can be independently evaluated onto different compute backends, if scheduling the intermediate ops on a different compute device and then performing a final (small) buffer copy is lower latency than running on the same device. I've noticed that I'm not really able to fully utilize the memory bandwidth on devices with the current cgraph.
I'm trying to optimize the cgraph scheduling to some extent; do you know of any place other than the implementation where there might be some descriptions of scheduling on mixed backends (i.e. gpu/cpu/npu)?

If you don't have time to answer these, or have any questions you'd like to ask, please let me know! Or, if there's another place for more casual conversations; I'll probably be able to get this ready for contrib sometime in november, and clarification regarding whether the caching abstraction is desired or not is much appreciated!

compilade Oct 28, 2024
Collaborator

there are a few complications I'd love your advice on, if you have a moment

I do happen to have a moment.

I may need to implement a hacked/custom mamba2 kernel to accommodate the caching implementation for the hybrid model.

Note that Mamba 2 is fully functional in #9126 (on CPU), although it's not using the (faster?) quadratic prompt processing made possible by the state space duality. A CUDA implementation of the recurrent parallel scan is only maybe a few hours/days of work away, though, but it depends on #9186, and it also depends on time I do not have this week.

This might be something we have to resolve when we contrib the model. Would you like us to contrib the caching implementation?

Well, the Jamba PR (#7531) is not in a mergeable state right now, and the caching will need to be refactored again anyway, so I think this would be great.

But for the cache, I recommend maybe having it in its own PR, since it's going to be used by a bunch of other models too (the other recurrent models, and the other hybrid models) and needs to be tested separately. Of course, you could also have it all in the same PR, but that might make it longer/harder to review (proved by my initially very big and hard to review Jamba PR).

I assume the Zamba implementation is likely very intertwined with the custom cache, (a bit like Jamba in #7531), but in retrospect separating the cache refactor from the model implementation is more workable.

Will your caching implementation be compatible with sliding windows (the current implementation wastes a lot of space by allocating the full KV cache even for layers with a sliding window (for models with an alternating per layer sliding window)), and what about state rollbacks? (Just keep it extensible and it should be fine. Note that I'm only asking this to make you aware of possible extra requirements of the cache, I do not have tried to fix the sliding window allocation yet.)

I think the plan with the KV cache management in llama.cpp is to eventually separate it into its own file and make it use proper object-oriented idioms (and also make it unit-testable). At least according to my understanding of #7531 (comment).

I have waaay less free time than I'd like this autumn, so I would appreciate if you could contribute or at least share what you've worked on, because in the small amount of free time I do have, I realized I prefer experimenting with ternary quantization-related stuff (apparently, the angles between ternary vectors can make nice fractals). Of course I still want to make Mamba 2 and hybrid models work, it's just that sometimes I find another rabbit hole I want to explore.

In your experience, with the style of small ops that mamba has near the top of the block, is there some cost from having those unary ops be on CPU? [...] I've noticed that I'm not really able to fully utilize the memory bandwidth on devices with the current cgraph.

I'm not sure what you're describing here. By "the current cgraph", do you mean when using a single op for the SSM scan (like Mamba 1 or Mamba 2 in #9126)? Or do you mean when using separate ops for each of the sub-operations in the state-space duality?

I'm trying to optimize the cgraph scheduling to some extent; do you know of any place other than the implementation where there might be some descriptions of scheduling on mixed backends (i.e. gpu/cpu/npu)?

Not sure. I did not investigate on mixed backend scheduling until now. There is however a collection of ggml-backend related PRs in ggml : backends (view), which might be what you're looking for?

From looking at ggml_backend_sched_compute_splits, I think only the backends which implement async compute can run in parallel with each other (and with themselves for pipeline parallelism), assuming their graph splits are not depending on each other (which assumes the splits happen at ideal places so that the inputs only get synchronized when and where they need to be (hopefully one of the passes in ggml_backend_sched_split_graph handles that?)).

For now, the only backends which implement async compute are CUDA, CANN, SYCL and RPC. Notably, the CPU backend doesn't yet. (I'm assuming that if a backend implements ggml_backend_i.synchronize, then async compute is likely supported).

Maybe it could be worth it to implement all the necessary operators for Mamba 2 on the faster backend? For example, for RWKV6, even with an already very fast RWKV_WKV operator on CPU, implementing it on CUDA like in #9454 made the end-to-end inference much faster.

I'll probably be able to get this ready for contrib sometime in november, and clarification regarding whether the caching abstraction is desired or not is much appreciated!

To be clear, a new caching abstraction is desired (because it has to be done anyway to support any hybrid model). I still don't know how exactly it will end up being, but I think your design will probably help get this closer to ideal.

Of course, I can't actually give comments on your implementation because AFAIK I don't have access to its code. But I'll be happy to make comments (hopefully without too much delay 😅) once you do publish it.

ggerganov Oct 28, 2024
Maintainer

I think the plan with the KV cache management in llama.cpp is to eventually separate it into its own file and make it use proper object-oriented idioms (and also make it unit-testable). At least according to my understanding of #7531 (comment).

@compilade Yes, I'm considering major refactoring of the KV cache implementation with the main goal being to support different implementations because the current "unified cache" approach has drawbacks for certain use cases (many slots with large contexts are exceedingly slow, non-determinism in parallel, etc.). I am not sure if I will be able to capture all the nuances important for non-transformer architectures during this refactoring, as my primary goal would be to fix the transformer use cases. If all goes well, it will be much easier to implement specialized KV caches for other architectures in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mamba2 in Llama.cpp #9196

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mamba2 in Llama.cpp #9196

Uh oh!

awgr Aug 27, 2024

Replies: 3 comments · 4 replies

Uh oh!

compilade Aug 27, 2024 Collaborator

Uh oh!

Uh oh!

awgr Aug 29, 2024 Author

Uh oh!

awgr Oct 13, 2024 Author

Uh oh!

awgr Oct 13, 2024 Author

Uh oh!

awgr Oct 24, 2024 Author

Uh oh!

Uh oh!

compilade Oct 28, 2024 Collaborator

Uh oh!

ggerganov Oct 28, 2024 Maintainer

awgr
Aug 27, 2024

Replies: 3 comments 4 replies

compilade
Aug 27, 2024
Collaborator

awgr
Aug 29, 2024
Author

awgr
Oct 13, 2024
Author

awgr Oct 13, 2024
Author

awgr Oct 24, 2024
Author

compilade Oct 28, 2024
Collaborator

ggerganov Oct 28, 2024
Maintainer