lmdeploy support kernel block size#4421
Conversation
|
@grimoire Please evaluate the necessity of introducing the parameter |
Another motivation for this pull request is that I hope to enable shared GPU memory for the linear attention cache and the full attention's key-value cache in qwen3-next. If the recurrent_cache can be stored within a single block, then linear attention cache management can be performed with the help of the block manager. However, I encountered an issue: the current block size is limited to a maximum of 64, so the recurrent_cache requires 8 consecutive blocks for storage, which would make block management very complex. After this PR, I can set the block size to a large value for storing a single recurrent_cache, while the kernel block size still uses 64. |
| head_size = model_config.head_dim | ||
| shape = cls._get_key_block_shape_impl( | ||
| model_config, | ||
| block_size=cache_config.block_size, |
There was a problem hiding this comment.
init
self.block_size = cache_config.block_size
allocate_custom_cache
custom_shape = self.get_custom_cache_shape_impl(
num_layers=num_layers,
num_blocks=self.num_gpu_blocks,
block_size=self.block_size,
shape=shape,
)
There was a problem hiding this comment.
I'm not quite sure what this means. Could you please explain it more clearly?
There was a problem hiding this comment.
lmdeploy/pytorch/paging/scheduler.py
Outdated
| block_offsets = [] | ||
| for seq in seqs: | ||
| block_offset = self.block_manager.get_block_table(seq) | ||
| block_offset = block_offset.repeat(self.kernel_blocks_per_kv) * self.kernel_blocks_per_kv + np.tile( |
There was a problem hiding this comment.
repeat and tile would allocate new memory as output. Try do this with broadcast.
There was a problem hiding this comment.
And we can do this in inputs_maker after tensorlize. Computing the new block table in a big ndarray/tensor should be better than looping.
Cool, this is also on our development roadmap, and I'm glad to see community developers getting involved. However, since this feature touches many core modules, we're cautious about adding it to the engine right after integrating Qwen3-next and Qwen3.5 — each large feature like this could become future technical debt. If you submit changes in this area, we can't guarantee a timely review or merge. |
9ea365d to
a8657d3
Compare
| off_len = len(off) | ||
| out[idx, :off_len] = off | ||
| off_len = len(off) * kernel_blocks_per_kv | ||
| out[idx, :off_len] = (off[:, None] * kernel_blocks_per_kv + kernel_block_arange).reshape(-1) |
There was a problem hiding this comment.
_tensorlize_block_offsets consumes significant CPU resources — avoid block expanding unless necessary. Additionally, we should be able to perform this directly on out after the loop. Repeatedly evaluating evalframes inside the loop introduces substantial overhead from extra Python function calls.
Also, I'd prefer adding a separate function after _tensorlize_block_offsets to handle this, rather than modifying _tensorlize_block_offsets itself.
There was a problem hiding this comment.
There is another "get block table" behaviour in create_model_inputs_long_context without using _tensorlize_block_offsets
|
Could you review this again? @grimoire |
|
Hi @lvhan028 , this PR was approved a while ago but hasn't been merged yet. Is there anything else needed on my end, or is it ready to go? |
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Currently, using a large block size (e.g., >=128) for the KV cache triggers a Triton kernel shared memory overflow. However, users maybe require the flexibility to set an arbitrary block size to control the granularity for the Prefix Cache and Block Manage.
This PR decouples the block size for cache management from the block size for kernel execution. It introduces two configurable parameters:
Modification
before pr:
lmdeploy serve api_server ../Qwen3-Next-80B-A3B-Thinking/ --backend pytorch --max-batch-size 16 --tp 4 --cache-block-seq-len 128Traceback
After pr:
lmdeploy serve api_server ../Qwen3-Next-80B-A3B-Thinking/ --backend pytorch --max-batch-size 16 --tp 4 --cache-block-seq-len 128 --kernel-block-size 32