Skip to content

Conversation

@offline893
Copy link
Contributor

@offline893 offline893 commented Oct 23, 2025

What this PR does / why we need it?

Check all expert maps when using muilty instance.

Does this PR introduce any user-facing change?

How was this patch tested?

Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2: master has expert map, slave has error expert map.
case3: master has expert map,slave has correct expert map.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a check to ensure all expert maps are consistent across multiple instances, which is a crucial bug fix for distributed MoE setups. The implementation in expert_load_balancer.py has a critical issue where it references a non-existent attribute self.tensor_data, which will cause a runtime error. I've provided a suggestion to fix this by using self.expert_map_tensor and correctly comparing the maps. I also noticed a small typo in the PR title ('muilty' should be 'multi').

Comment on lines +102 to +117
def check_expert_map_tensor(self):
if dist.is_initialized():
try:
rank = dist.get_rank()
world_size = dist.get_world_size()
all_expert_maps = [None for _ in range(world_size)]
dist.all_gather_object(all_expert_maps, self.tensor_data)
for rank_id, expert_map_tensor in enumerate(all_expert_maps):
if self.tensor_data != expert_map_tensor:
raise ValueError(
f"The expert map of rank{rank} is not equal to rank{rank_id}"
)
return True
except Exception as e:
raise ValueError(
f"The expert maps of all ranks are inconsistency: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This method has a critical bug: it references self.tensor_data, which is not an attribute of the ExpertLoadBalancer class. This will cause a NameError at runtime. Based on the class's __init__ method, you probably meant to use self.expert_map_tensor.

Additionally, comparing tensors with != is not suitable for checking equality in a boolean context, as it performs an element-wise comparison and returns a boolean tensor.

To fix this, I suggest converting the tensor to a list before gathering and comparing the lists. This approach seems to align with your use of dist.all_gather_object.

I've also corrected a typo ('inconsistency' -> 'inconsistent') and improved f-string formatting for better readability.

Suggested change
def check_expert_map_tensor(self):
if dist.is_initialized():
try:
rank = dist.get_rank()
world_size = dist.get_world_size()
all_expert_maps = [None for _ in range(world_size)]
dist.all_gather_object(all_expert_maps, self.tensor_data)
for rank_id, expert_map_tensor in enumerate(all_expert_maps):
if self.tensor_data != expert_map_tensor:
raise ValueError(
f"The expert map of rank{rank} is not equal to rank{rank_id}"
)
return True
except Exception as e:
raise ValueError(
f"The expert maps of all ranks are inconsistency: {e}")
def check_expert_map_tensor(self):
if dist.is_initialized():
try:
rank = dist.get_rank()
world_size = dist.get_world_size()
all_expert_maps = [None for _ in range(world_size)]
current_map_list = self.expert_map_tensor.tolist()
dist.all_gather_object(all_expert_maps, current_map_list)
for rank_id, other_map_list in enumerate(all_expert_maps):
if current_map_list != other_map_list:
raise ValueError(
f"The expert map of rank {rank} is not equal to rank {rank_id}"
)
return True
except Exception as e:
raise ValueError(
f"The expert maps of all ranks are inconsistent: {e}")

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@yiz-liu yiz-liu merged commit 4e21b15 into vllm-project:v0.11.0-dev Oct 24, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants