Skip to content

Conversation

@HF-001
Copy link
Contributor

@HF-001 HF-001 commented Nov 12, 2025

What this PR does / why we need it?

Some EAGLE3 drafters might have their own lm_head and/or embed_tokens layers. Existing codebase ignores this.By solving this problem, it can greatly improve acceptance rates. Refer to this pr in vllm: vllm-project/vllm#27737

Does this PR introduce any user-facing change?

How was this patch tested?

export CUDA_VISIBLE_DEVICES=0
export TP=1
export MODEL_PATH=/model/Llama-3.1-8B-Instruct
export MODEL_NAME=Llama-3.1-8B-Instruct
export PORT=10133
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port ${PORT} --dtype bfloat16 --model ${MODEL_PATH} --served-model-name ${MODEL_NAME} --tensor-parallel-size ${TP} --gpu-memory-utilization 0.85 --max-model-len 32768 --trust-remote-code --seed 42 --speculative_config '{"method":"eagle3","model":"/model/EAGLE3-LLaMA3.1-Instruct-8B","num_speculative_tokens":5,"draft_tensor_parallel_size":1}'

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to prevent overwriting the lm_head and embed_tokens of a drafter model if they are already initialized, which is particularly important for some EAGLE3 drafters. The changes correctly add checks for has_own_embed_tokens and has_own_lm_head flags before sharing these layers from the target model. However, I've identified a critical regression in the logic for handling lm_head for SpecDcodeType.EAGLE. The check that ensures the target model has an lm_head attribute before it's accessed has been removed, which will likely cause an AttributeError at runtime. My review includes a specific code suggestion to fix this issue.

Comment on lines 145 to 150
if hasattr(model, "lm_head"):
logger.info("Loading EAGLE LM head weights from the target model.")
if supports_multimodal(model):
self.model.lm_head = model.get_language_model().lm_head
else:
self.model.lm_head = model.lm_head
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for sharing lm_head for SpecDcodeType.EAGLE has been changed in a way that introduces a potential AttributeError. The assignments to self.model.lm_head are no longer guarded by hasattr(model, "lm_head"). If the target model does not have an lm_head attribute, this will cause a crash. The original logic was safer and should be restored.

Suggested change
if hasattr(model, "lm_head"):
logger.info("Loading EAGLE LM head weights from the target model.")
if supports_multimodal(model):
self.model.lm_head = model.get_language_model().lm_head
else:
self.model.lm_head = model.lm_head
if hasattr(model, "lm_head"):
logger.info("Loading EAGLE LM head weights from the target model.")
if supports_multimodal(model):
self.model.lm_head = model.get_language_model().lm_head
else:
self.model.lm_head = model.lm_head

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant