-
Notifications
You must be signed in to change notification settings - Fork 515
[Perf] Delete redundant operations in model_runner and forward_context #3677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the codebase by moving some checks into utility functions and removing redundant operations. The changes in model_runner_v1.py correctly fix a bug where stale data was being used and remove redundant code, improving correctness and clarity. However, the new utility function has_layer_idx in utils.py introduces a critical bug due to a flawed caching mechanism. I've provided a comment with a suggested fix for this issue.
| def has_layer_idx(model_instance: torch.nn.Module) -> bool: | ||
| global _HAS_LAYER_IDX | ||
| if _HAS_LAYER_IDX is None: | ||
| _HAS_LAYER_IDX = model_instance is not None and \ | ||
| hasattr(model_instance, "model") and \ | ||
| hasattr(model_instance.model, "start_layer") | ||
| return _HAS_LAYER_IDX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation of has_layer_idx uses a global variable _HAS_LAYER_IDX to cache its result. This caching mechanism is flawed because the function's outcome depends on the model_instance argument, which is not guaranteed to be the same across all calls.
For instance, set_ascend_forward_context can be invoked with model_instance=None (e.g., from kv_connector_no_forward), which would cause _HAS_LAYER_IDX to be cached as False. Any subsequent calls, even with a valid model_instance, would then incorrectly return False, preventing features that rely on layer_idx from being enabled.
Since this check is inexpensive, I recommend removing the caching mechanism to fix this bug. The global variable _HAS_LAYER_IDX should also be removed.
def has_layer_idx(model_instance: torch.nn.Module) -> bool:
return (model_instance is not None and hasattr(model_instance, "model") and
hasattr(model_instance.model, "start_layer"))6547eb7 to
7dc6a33
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
ce910ca to
35b0878
Compare
35b0878 to
804c357
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: realliujiaxu <[email protected]>
Signed-off-by: realliujiaxu <[email protected]>
804c357 to
453e3ff
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Remove redundant operations from
model_runnerandforward_context. This optimization can significantly reduce the idle time (bubble) before decoding when running models with small parameter counts (e.g., Qwen/Qwen2.5-0.5B).Testing on 800I A2, bubble is reduced from 3.8ms to 2.8ms :

Before
After

Does this PR introduce any user-facing change?
No
How was this patch tested?