Replies: 3 comments 4 replies
-
Yes, I saw your comment, and I'd like to see that code eventually. I'm curious about SSD in
This reminds me that both of these things (state checkpoints and separate KV cache) are implemented since a while ago in the Jamba PR (#7531), although I need to update it with the latest changes on So you're working on a hybrid Mamba-2 model? Interesting! By "recurrent state caching", do you mean your implementation of SSD always starts from a blank state? (because I think it's done that way in HF I'm assuming you meant state checkpoints as in Jamba. A lot of the paragraphs below assume your implementation can take existing states as input.
Actually, I think we're taking different approaches here, which might not totally conflict, except only because the model graph needs to keep the same structure between runs (which will be important when porting this to GPU, because it avoids having to re-synchronize the graph). I was initially focusing on optimizing the recurrence for text generation, so I did not really put much time towards implementing SSD yet, so we're alright on this point. I think SSD should be good for long contexts. And your approach should also not conflict with the CUDA implementation of the SSM scan in #9186 (not yet adapted to Mamba-2), although it does add new operators to implement. I'm actually curious about how semistructured matrices perform at small batch sizes. If it's slower than the recurrence, with how many tokens per batch does it become faster? (If it's always faster, then the choice would be so much easier) One problem I've noticed with recurrence with Mamba-2 is that the states are so big that copying them takes a non-negligible amount of time when generating text. On my laptop, with a pure Mamba-2 model, this takes at least 10% of the CPU time (excluding the actual SSM scan), and is not necessarily specific to how the SSM is calculated, but how the states are prepared at each iteration. I'm currently trying to figure out a way to avoid copies when not necessary. I don't know if it will also be easy to apply to SSD because that depends on how it is implemented.
I'd like to see your actual branch. I suggest you push it to a fork of If there's somehow a way to keep the graph structure constant and always use the fastest implementation for the batch size, that would be very great, but might require other changes in |
Beta Was this translation helpful? Give feedback.
-
I will follow up more on this very soon; I'm expecting to have everything ready by the end of the second week of September, or earlier.
An interesting experiment, the work of designing a mechanism that selects which to use seems useful.
I saw your work on Jamba! It helped me to understand several things about ggml that were essential in getting the SSD implementation together. Same with your existing mamba 1 implementation.
Yes! There are several models in the series. Here is a small one: https://huggingface.co/Zyphra/Zamba2-1.2B
I realized I've miscommunicated here; due to my background I think of the states that are described by a recurrence relation over sequential forward passes as recurrent state, but I understand that's not a canonical way of thinking about it. I'm referring specifically to what are often called kv cache and ssm/conv cache.
Thank you for mentioning this; yes, there are some parts that are a bit problematic as a result. Regarding kernel and GPU support, the current implementation is fairly rough and has very little.
I don't have profiling ready yet, but once I do I can probably answer this to some degree.
Yes, I think the first version unfortunately will be very inefficient...in the current impl, preparing of various states is extensive, and likely more than even naively necessary. I'll have a branch available for review relatively soon. It might be, oh, a week before the entire model is ready though. For clarification, the SSD implementation is complete, but the model implementation is not yet finished. |
Beta Was this translation helpful? Give feedback.
-
@compilade I have the working implementation now; I added a generic/flexible caching component to llama.cpp for tensors with a static shape, and there is no prefill to circumvent the issues with the static cgraph. Otherwise, it's seems to work well; there were quite a few small issues, and I need to clean up the implementation quite a lot, but it's more or less possible to review it, though I doubt it's quite ready for contribution. |
Beta Was this translation helpful? Give feedback.
-
@compilade
I mentioned on one of your MRs; we have an implementation of Mamba2 as part of our model. It includes quite a lot of changes in ggml (adding expf, prefix sum, softplus, and ssd_segsum). There are some untested parts (e.g. recurrent state caching) though I expect a somewhat hacky implementation will work. In our model, we need a more flexible state cache (for both kv and other recurrent states).
I'm discussing with people about trying to get it ready for contrib, but I suspect you're working on this already, and our fork is rather different; thoughts?
I'm comfortable with just pasting the specific and relevant dense parts here too, if they're a useful resource for you in your ongoing work, but if you'd prefer, and it doesn't conflict with your (stellar) effort on SSM/SSD models, we can toss up an MR instead.
Beta Was this translation helpful? Give feedback.
All reactions