- 
                Notifications
    You must be signed in to change notification settings 
- Fork 530
【main】kvpool sync load #3653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【main】kvpool sync load #3653
Conversation
Signed-off-by: fems14 <[email protected]>
| 👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge: 
 If CI fails, you can run linting and testing checks locally according Contributing and Testing. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a synchronous loading path for the KV cache in the Mooncake engine, controlled by a new load_async configuration flag. The changes are logical and correctly implement the synchronous behavior. However, there are two significant instances of code duplication introduced. The synchronous loading logic is copied from the asynchronous thread handler, and the prepare_value helper method is duplicated from another class. My review includes suggestions to refactor this duplicated code to improve maintainability.
| else: | ||
| if self.m_store.config.use_ascend_direct: | ||
| addr_list = [] | ||
| size_list = [] | ||
| key_list = [] | ||
| blockIds = [] | ||
| for start, end, key in self.token_database.process_tokens( | ||
| tokens, token_mask): | ||
| addr, size, block_id = self.prepare_value( | ||
| start, end, request.block_ids) | ||
| key_list.append(key.to_string()) | ||
| addr_list.append(addr) | ||
| size_list.append(size) | ||
| blockIds.append(block_id) | ||
| self.m_store.get_batch(key_list, addr_list, size_list, blockIds) | ||
| else: | ||
| for start, end, key in self.token_database.process_tokens( | ||
| tokens, token_mask): | ||
| addr, size, _ = self.prepare_value(start, end, request.block_ids) | ||
| self.m_store.get(key, addr, size) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new block for synchronous loading is almost identical to the logic in KVCacheStoreRecvingThread._handle_request. This code duplication can lead to maintenance issues where a bug fix or change in one place might be missed in the other.
To improve maintainability, I recommend extracting this logic into a new private helper method within the MooncakeEngine class, for example, _load_kv_sync. The synchronous path can then call this method.
Ideally, KVCacheStoreRecvingThread should also be refactored to use this new helper method to completely eliminate the duplication. This would likely involve passing the MooncakeEngine instance to the thread's constructor.
| def prepare_value(self, start: int, end: int, block_ids: list[int]): | ||
| addr_list = [] | ||
| size_list = [] | ||
| block_id = block_ids[start // self.block_size] | ||
| for index, base_addr in enumerate(self.kv_caches_base_addr): | ||
| block_len = (self.block_len[index % 2] | ||
| if self.use_mla else self.block_len[0]) | ||
|  | ||
| addr = base_addr + block_id * block_len | ||
| length = int(block_len / self.block_size * (end - start)) | ||
| addr_list.append(addr) | ||
| size_list.append(length) | ||
| return addr_list, size_list, block_id | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prepare_value method is a direct copy of the method with the same name in KVTransferThread from kv_transfer.py. Duplicating code like this makes the codebase harder to maintain.
It would be better to refactor this into a common utility function that can be called from both MooncakeEngine and KVTransferThread. This function would take necessary context (like block_size, kv_caches_base_addr, block_len, and use_mla) as arguments, ensuring the logic is defined in a single place.
| please update the README as well https://github.com/vllm-project/vllm-ascend/tree/main/examples/disaggregated_prefill_v1 | 
Signed-off-by: fems14 <[email protected]>
Signed-off-by: fems14 <[email protected]>
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?