Support for KV caching and batched inference #1934

mseeger · 2025-02-06T09:37:15Z

Adds abstraction for key-value caches, implements batched inference.

I am also adding two baseline KV caches, the default one from before (all KV are stored) and a last-recent one.

The abstraction contains methods not used by these baselines, but they are required to implement more advanced KV caches such as Heavy Hitter Oracle (H2O).

I have implemented some of these, but I may not be allowed to contribute them here (working for a company). I'll see what I can do.

t-vi · 2025-02-06T10:47:06Z

Hey, great work @mseeger .

Can we decouple things a lot, though?

Some initial thoughts:

I would prefer if we kept the KVCache initialization as in the current version (i.e. that you initialize the model, then potentially adjust the max seq len and then initialize the KVCache) in this PR. Adding this to the init parameters seems orthogonal to the other changes.
We do have batched generation today. Can we please split changes to batched generation from the KVCache improvements. We probably don't want to do batching via lists of tensors. I'm currently looking at passing in "packed" input/input_pos sequences but these changes. Changing the existing tests should be a bit of a red flag, as it will screw existing users to change the API (we can do this if we need to, but TBH I am not convinced this is the case).
In general, can we be very conservative with adding arguments? For optional arguments, we should look into making them keyword-only unless there is a good reason not to.
We do keep control flow simple. self._default_kv_cache = False is not a good idea.
I'm not sure I understand the both_in_parallel. Maybe the right time to add it and the associated refactors is when they are used?
I'm generally a bit weary of the amount of data structures and cases that are being passed around here, those add a lot of complexity. To my mind, this likely means that the right abstraction has not yet been found. Maybe integrating KVCache and SDPA more could be a thing, but I am not sure.
In general, we do not want to do the cache setup during the forward. Please keep the initialization separate. I think we are rather seeing movement towards less of a distinction between pre-fill and next token, so this seems a bit in the wrong direction.

Again, super good stuff in the PR! I think there are a few things to split out and consider individually and then maybe we can have a video call about the core KVCache things, wdyt?

Thanks for the initiative for better KVCacheing!

mseeger · 2025-02-06T15:02:53Z

Hello, sure we can have a call, I am in the central Europe (Germany) time zone.

mseeger · 2025-02-06T15:05:33Z

My impression was that batched generation is not really there. But if it is, I don't ask to change it.

One thing is important through. KV caches really work by filling positions sequentially. So, you filled positions 0:(T-1), you need to continue with T, or with T:(T+k). The current API of just passing some position indexes is really not going to work.

mseeger · 2025-02-06T15:08:45Z

Also, the implementation right now allows you to send in KV cache objects from the start. If you do not do that, it will create them by default. This is done by set_kv_cache. If you also do not do that, it is done in the first forward with for_prefill=True.

Note that prefill here means that I can do a single pass, and the cache can take it all, without having to evict anything. It does not mean that this will encode even the shortest prompt in the batch. If prompts are longer than the max prefill length, you need to do it sequentially in chunks.

Maybe there is an easier way, we can discuss.

mseeger · 2025-02-06T15:10:01Z

It is annoying I cannot show you the KV cache code I have. But in a talk, I could explain why a few things are the way they are. Of course, I am not on top of other constraints you guys have.

mseeger · 2025-02-06T15:25:30Z

You may ask why KVCache.prefill? The main reason is that you want to use SDPA whenever you can, but SDPA cannot return the attention weights, which some KV cache algorithms (H2O) need in order to decide what to evict next.

We can do things so the very first call to the model, with input_pos=0, is doing this. So, instead of

model(x, for_prefill=True)

you'd call

model(x, input_pos=0)

This I could do. That would indeed be a little simpler.

mseeger · 2025-02-06T19:58:33Z

@t-vi Let me know what the next steps here should be. If I understand correctly, I could:

Get rid of for_prefill parameter, and use input_pos=0 instead
Don't create default KV cache in forward and rather fail the call if input_pos is used, s.t. user needs to call set_kv_cache
You don't seem to approve of passing the KV caches at construction (if user does not want to use default ones). Would you rather use set_kv_cache for that?

t-vi · 2025-02-09T09:50:37Z

Hi, so I think we should try to break things down.

We could either start with the core caching itself and try to see how to integrate it with minimal changes or see what is the deal with batching and prefill first.
I sent to your gmail address to find a good time to discuss.

mseeger · 2025-02-21T07:51:53Z

Hello @t-vi , let me try to break things down. Changes are these:

KVCache and its implementations. This replaces the default cache, which just stores everything. No behavior changes.
Caches for each layer can be passed when model is created. Before, there is set_kvcache, which creates the default
caches. If nothing is done at all, default caches are created when first needed. This is a change. Before, it would create
an exception.
Refactoring of generation code: This works for batch generation now, and single sequence generation is a special case.
Inside, this properly supports large prompts by splitting generation into prefill (as large as caches allow), and then
aequential blocks of desired length.

mseeger · 2025-02-21T07:54:18Z

If I understand you correctly, you complain about 2., especially the automatic creation of default cache when nothing is done, and the change of __init__ of GPT. This, I can work on. I could to the following:

Allow passing KV caches per layer in set_kvcache (or have another method?)
Create default KV caches by calling set_kvcache. If this is not done, calling forward for inference fails, so no cache is created automatically

Would that be what you prefer?

mseeger · 2025-02-21T07:56:07Z

As for 1. and 3., in the end, they go together, but I can try split it into two. I'd first do 1., keeping the generation code in place, which would however not work for batches and not support the sequential processing of prompts properly.

First doing 3. is not really sensible, because it requires things from 1.

What do you think?

mseeger · 2025-02-21T07:57:03Z

Note that with DeepSeek (I am involved trying to bring this to Hugging Face), there is a lot of movement now not to ignore KV caching in the future. They even released a paper now how they can train with large contexts.

mseeger · 2025-02-24T15:21:41Z

OK, I did 2) AFAI understand. I'd work on 1) once I find time.

mseeger · 2025-02-24T16:21:16Z

No idea why all these tests are failing. Tests work for me locally.

mseeger · 2025-02-26T21:10:05Z

@t-vi Maybe I can change your mind about first keeping the current generation code in place, and only contribute the KV cache support?

This is quite a bit of extra work for me, and new code of mine has a number of improvements. in particular, the current code does not really do batch generation, it is marked with several TODO and is not used.

If we could have a chat, I'd appreciate that.

mseeger · 2025-02-27T10:49:27Z

Your CI system seems to be broken still.

mseeger · 2025-02-27T11:08:48Z

Out of curiosity: Why do you object to batch prompts being a list of tensors? In the end, they can have wildly different lengths, and there is not much you can do against that (sure, if you get lots of requests, you can maybe cluster them, but doing this too much delays requests, so increases latency).

Also, you really don't want to push PAD tokens into models just because a prompt in a batch happens to be shorter than others. The model, not being trained on this, would certainly get confused. And since you need to start token-by-token forward for generation, you really gain nothing by padding prompts.

I always thought if this as some kind of TensorFlow artefact when all tensors had to be allocated up front, etc. But I thought we have overcome this with PyTorch.

t-vi · 2025-02-27T11:25:55Z

Hey, sorry, I am totally swamped, still want to have a video call to chat.

Out of curiosity: Why do you object to batch prompts being a list of tensors? In the end, they can have wildly different lengths, and there is not much you can do against that (sure, if you get lots of requests, you can maybe cluster them, but doing this too much delays requests, so increases latency).

Because lists are a lot less nice to work with in various setups passing to kernels, cudagraphs etc.

For somewhat homogeneous seq lengths, padding works fine. We are using it in production, so I'm doubting claims that it does not work. It does have limitations with the inhomogenous sequence lengths, which we want to support.

But the proper way to support this is packed sequences, i.e. pass in flat input_tokens, input_pos (i.e. 1d shape, no batch index) and then batch_seq_lens of shape batch_size.
batch_seq_lens gives lengths for each batch item (and might even be 0).

This is hugely more flexible. It needs FlexAttention or somesuch https://pytorch.org/docs/stable/nn.attention.flex_attention.html to make it work efficiently in stock PyTorch.

mseeger · 2025-02-27T12:16:38Z

Let me know when is a good time. I am in Europe time zone

mseeger · 2025-03-05T15:44:02Z

After our call, I think I understand more what you mean. Something like an abstraction in multi-head attention, where the input are keys, values, query for the current input chunk, all the same size, but then this is bundled:

Take in keys, values and replace with KV-cached ones, so now keys, values are larger
Do the SDPA computation
Feed attention weights back to KV cache if needed
Return MHA outputs before final linear mapping

This makes a lot of sense, and is quite elegant.

Borda · 2025-03-18T09:15:46Z

Your CI system seems to be broken still.

@mseeger shall be fixed now, thank you for your patience :)

mseeger · 2025-03-19T10:22:44Z

As discussed with @t-vi , I'll refactor this as stated in the comment above. Makes total sense

mseeger · 2025-04-04T19:34:26Z

OK, I've taken out the batched inference code. Still working on fixing the tests (and need to refactor speculative decoding), but this is essentially it.

mseeger · 2025-04-28T13:32:00Z

@t-vi , it would be great to get some feedback on this one, before I spent time on fixing tests for code which I need to change afterwards anyway.

mseeger · 2025-04-29T06:31:11Z

BTW: Even vLLM does not have a consistent support of different KV caching techniques and strategies. They just offer some blunt stuff like tensor parallelism and quantization.

This could be a real differentiating feature of LitGPT.

If you know another open source library that indexes on KV caching, and which you'd like to integrate with instead, please let me know.

mseeger · 2025-05-12T16:00:23Z

@t-vi , @Borda : Any change there will be some progress here?

I recognize this is a big PR. On the other hand, a decent support for selective (sparse) KV caching could be a real differentiator for LitGPT, in that none of the other OS libraries spend effort on it, not even vLLM.

I made quite some progress, also on fine-tuning with long contexts. I am trying to get approval to open source this. In my team, we start to use LitGPT quite seriously (HF being just too messy).

mseeger · 2025-05-12T16:01:08Z

One thing still missing is good support for batch inference without excessive padding.

mseeger · 2025-05-27T07:08:53Z

@t-vi , @Borda: Any sign of life on this PR? A few things:

This is not meant to be a single PR, but I am happy to chop it into smaller pieces. At the moment, I am simply using this PR to get my work done on the extra repo I am writing
I made substantial progress: I now have a method to compute gradients (!) with KV caching inside. To my best knowledgem, this does not exist so far in any OS repo. This could be a real strong point for LitGPT
I am getting closer to being allowed to open source my code (which is a library on top of LitGPT), but this needs some integration into your code, which is what I am proposing here
Finally, I'd argue my code is also cleaning a few things up

mseeger · 2025-05-27T07:10:33Z

If things go well, I'd love to publish this to a conference with deadline in fall, and if everything was open sourced and in LitGPT at the time of submission, or a bit later, that would be best for everybody.

I don't know if your customers struggle with fine-tuning on long context widths. We certainly do. The library I am writing is a solution to that.

Borda · 2025-05-29T13:46:03Z

@mseeger, so is this ready for review?

= 410 failed, 23 passed, 35 skipped, 32 xfailed, 347 warnings in 303.27s (0:05:03) =

mseeger · 2025-05-30T06:14:47Z

@Borda I can work on making the tests pass. Just after a call with @t-vi , he asked for a different abstraction for KV caches, and I changed it accordingly, so was hoping for more comments on whether what I am doing here is the right thing.

The PR also grew quite large over time, since I added small changes I need to make gradient computations work. I'll go over it and check how it can be split into several ones.

mseeger · 2025-05-30T06:15:55Z

For me, it'd be most important if you could make a comment on whether the KV cache abstraction is good, including factoring out the multi-head attention code in attention.py and cleaning it up.

for more information, see https://pre-commit.ci

mseeger · 2025-05-30T07:58:49Z

OK, this PR contains the following parts:

Small things: Start of layer hook in GPT.forward, skip_lm_head in GPT.forward. I need these for gradient computation, but also to put proper head models on top of the transformer. This is generally useful.
Refactoring of multi-head attention: This is needed in order to implement the KV cache abstraction in the way @t-vi suggested (in a phone call). But it also really simplifies things. It also removes a major issue: mask_cache requires lots of memory, it is now computed on demand, with particular attention to inference (where query is much smaller than key)
Proper KV cache abstraction, which modifies slightly how GPT.forward is called (namely, input_pos as int). This simplifies things, though. I also provide a few default implementations. DenseKVCache replicates what is currently in place.

In the library I am writing, there are a number of additional more powerful KV caches, such as H2O and quantization-aware H2O. I am also working on fine-tuning in the presence of KV caches. The abstraction I propose here, enables all of that.

If these changes are not done, I'd have to copy and change quite a bit of your code. This would be hard to maintain, and would run the risk that KV caches are implemented differently at a later point, and then things really diverge.

As I said in the comments above, I found KV caching to be super-important to make large context inference work on a moderate GPU budget, which should be of interest to your customers as well.

mseeger · 2025-05-30T08:01:43Z

I could now work to fix the broken tests. But I also would like to get some input, related to the comment "Can we decouple things a lot, though?" at the very top here.

I could break this into two, first the "small things" and "refactoring MHA", then the "KV cache abstraction". This is more work for me, but I'd be OK doing it, if this means to get it merged. While "refactoring MHA" has its own benefits, it is of course motivated by the later goal.

mseeger · 2025-05-30T13:08:53Z

I am continuing this work in #2061

mseeger requested review from lantiga and t-vi as code owners February 6, 2025 09:37

mseeger force-pushed the kvcache3 branch from ff817a9 to e27a445 Compare February 6, 2025 15:17

mseeger force-pushed the kvcache3 branch from e27a445 to 30fbada Compare February 8, 2025 17:24

mseeger force-pushed the kvcache3 branch from 3847686 to dc66754 Compare February 24, 2025 16:03

mseeger force-pushed the kvcache3 branch from 94037d2 to e41e8d4 Compare February 26, 2025 21:28

Borda added the enhancement New feature or request label Mar 12, 2025

mseeger force-pushed the kvcache3 branch from e41e8d4 to 461c649 Compare March 12, 2025 20:17

mseeger force-pushed the kvcache3 branch from aedd3a9 to 6528861 Compare April 8, 2025 18:51

mseeger requested a review from Borda as a code owner April 8, 2025 18:51

mseeger mentioned this pull request Apr 16, 2025

input_pos_maxp1 as a Python integer #2016

Merged

Borda marked this pull request as draft April 22, 2025 11:22

mseeger force-pushed the kvcache3 branch 2 times, most recently from ce22f21 to e1b834f Compare April 25, 2025 09:47

mseeger force-pushed the kvcache3 branch from f9d6041 to a1f2626 Compare April 29, 2025 16:37

mseeger force-pushed the kvcache3 branch from 3af1b97 to f25f034 Compare May 7, 2025 19:58

mseeger force-pushed the kvcache3 branch 2 times, most recently from ed4de7c to a268cfa Compare May 16, 2025 07:01

mseeger force-pushed the kvcache3 branch 2 times, most recently from 986564a to 8380f6a Compare May 26, 2025 13:48

Support for advanced KV caching and batch generation

5b994c3

mseeger force-pushed the kvcache3 branch from fea9908 to 5b994c3 Compare May 30, 2025 07:08

[pre-commit.ci] auto fixes from pre-commit.com hooks

e3fa10e

for more information, see https://pre-commit.ci

mseeger mentioned this pull request May 30, 2025

Refactoring of multi-head attention and support for KV caching #2061

Open

mseeger closed this May 30, 2025

Support for KV caching and batched inference #1934

Support for KV caching and batched inference #1934

Uh oh!

Conversation

mseeger commented Feb 6, 2025

Uh oh!

t-vi commented Feb 6, 2025

Uh oh!

mseeger commented Feb 6, 2025

Uh oh!

mseeger commented Feb 6, 2025

Uh oh!

mseeger commented Feb 6, 2025

Uh oh!

mseeger commented Feb 6, 2025

Uh oh!

mseeger commented Feb 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mseeger commented Feb 6, 2025

Uh oh!

t-vi commented Feb 9, 2025

Uh oh!

mseeger commented Feb 21, 2025

Uh oh!

mseeger commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mseeger commented Feb 21, 2025

Uh oh!

mseeger commented Feb 21, 2025

Uh oh!

mseeger commented Feb 24, 2025

Uh oh!

mseeger commented Feb 24, 2025

Uh oh!

mseeger commented Feb 26, 2025

Uh oh!

mseeger commented Feb 27, 2025

Uh oh!

mseeger commented Feb 27, 2025

Uh oh!

t-vi commented Feb 27, 2025

Uh oh!

mseeger commented Feb 27, 2025

Uh oh!

mseeger commented Mar 5, 2025

Uh oh!

Borda commented Mar 18, 2025

Uh oh!

mseeger commented Mar 19, 2025

Uh oh!

mseeger commented Apr 4, 2025

Uh oh!

mseeger commented Apr 28, 2025

Uh oh!

mseeger commented Apr 29, 2025

Uh oh!

mseeger commented May 12, 2025

Uh oh!

mseeger commented May 12, 2025

Uh oh!

mseeger commented May 27, 2025

Uh oh!

mseeger commented May 27, 2025

Uh oh!

Borda commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

mseeger commented May 30, 2025

Uh oh!

Uh oh!

mseeger commented Feb 6, 2025 •

edited

Loading

mseeger commented Feb 21, 2025 •

edited

Loading

Borda commented May 29, 2025 •

edited

Loading