-
Notifications
You must be signed in to change notification settings - Fork 563
[bugfix]Prevent overwriting drafters lm-head and embed_tokens #4134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: 01267596 <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to prevent overwriting the lm_head and embed_tokens of a drafter model if they are already initialized, which is particularly important for some EAGLE3 drafters. The changes correctly add checks for has_own_embed_tokens and has_own_lm_head flags before sharing these layers from the target model. However, I've identified a critical regression in the logic for handling lm_head for SpecDcodeType.EAGLE. The check that ensures the target model has an lm_head attribute before it's accessed has been removed, which will likely cause an AttributeError at runtime. My review includes a specific code suggestion to fix this issue.
| if hasattr(model, "lm_head"): | ||
| logger.info("Loading EAGLE LM head weights from the target model.") | ||
| if supports_multimodal(model): | ||
| self.model.lm_head = model.get_language_model().lm_head | ||
| else: | ||
| self.model.lm_head = model.lm_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for sharing lm_head for SpecDcodeType.EAGLE has been changed in a way that introduces a potential AttributeError. The assignments to self.model.lm_head are no longer guarded by hasattr(model, "lm_head"). If the target model does not have an lm_head attribute, this will cause a crash. The original logic was safer and should be restored.
| if hasattr(model, "lm_head"): | |
| logger.info("Loading EAGLE LM head weights from the target model.") | |
| if supports_multimodal(model): | |
| self.model.lm_head = model.get_language_model().lm_head | |
| else: | |
| self.model.lm_head = model.lm_head | |
| if hasattr(model, "lm_head"): | |
| logger.info("Loading EAGLE LM head weights from the target model.") | |
| if supports_multimodal(model): | |
| self.model.lm_head = model.get_language_model().lm_head | |
| else: | |
| self.model.lm_head = model.lm_head |
Signed-off-by: 01267596 <[email protected]>
Signed-off-by: 01267596 <[email protected]>
Signed-off-by: 01267596 <[email protected]>
Signed-off-by: 01267596 <[email protected]>
Signed-off-by: 01267596 <[email protected]>
What this PR does / why we need it?
Some EAGLE3 drafters might have their own lm_head and/or embed_tokens layers. Existing codebase ignores this.By solving this problem, it can greatly improve acceptance rates. Refer to this pr in vllm: vllm-project/vllm#27737
Does this PR introduce any user-facing change?
How was this patch tested?
export CUDA_VISIBLE_DEVICES=0
export TP=1
export MODEL_PATH=/model/Llama-3.1-8B-Instruct
export MODEL_NAME=Llama-3.1-8B-Instruct
export PORT=10133
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.85 --max-model-len 32768 --trust-remote-code --seed 42 --speculative_config '{"method":"eagle3","model":"/model/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":5,"draft_tensor_parallel_size":1}'