[BugFix]Check all expert maps when using muilty instance. #3662

offline893 · 2025-10-23T07:14:42Z

What this PR does / why we need it?

Check all expert maps when using muilty instance.

Does this PR introduce any user-facing change?

How was this patch tested?

Qwen 235B in double A3.
case1：master has expert map, slave has not expert map.
case2: master has expert map, slave has error expert map.
case3: master has expert map,slave has correct expert map.

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

Signed-off-by: offline0806 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a check to ensure all expert maps are consistent across multiple instances, which is a crucial bug fix for distributed MoE setups. The implementation in expert_load_balancer.py has a critical issue where it references a non-existent attribute self.tensor_data, which will cause a runtime error. I've provided a suggestion to fix this by using self.expert_map_tensor and correctly comparing the maps. I also noticed a small typo in the PR title ('muilty' should be 'multi').

gemini-code-assist · 2025-10-23T07:16:17Z

vllm_ascend/ops/expert_load_balancer.py

+    def check_expert_map_tensor(self):
+        if dist.is_initialized():
+            try:
+                rank = dist.get_rank()
+                world_size = dist.get_world_size()
+                all_expert_maps = [None for _ in range(world_size)]
+                dist.all_gather_object(all_expert_maps, self.tensor_data)
+                for rank_id, expert_map_tensor in enumerate(all_expert_maps):
+                    if self.tensor_data != expert_map_tensor:
+                        raise ValueError(
+                            f"The expert map of rank{rank} is not equal to rank{rank_id}"
+                        )
+                return True
+            except Exception as e:
+                raise ValueError(
+                    f"The expert maps of all ranks are inconsistency: {e}")


This method has a critical bug: it references self.tensor_data, which is not an attribute of the ExpertLoadBalancer class. This will cause a NameError at runtime. Based on the class's __init__ method, you probably meant to use self.expert_map_tensor.

Additionally, comparing tensors with != is not suitable for checking equality in a boolean context, as it performs an element-wise comparison and returns a boolean tensor.

To fix this, I suggest converting the tensor to a list before gathering and comparing the lists. This approach seems to align with your use of dist.all_gather_object.

I've also corrected a typo ('inconsistency' -> 'inconsistent') and improved f-string formatting for better readability.

Suggested change

def check_expert_map_tensor(self):

if dist.is_initialized():

try:

rank = dist.get_rank()

world_size = dist.get_world_size()

all_expert_maps = [None for _ in range(world_size)]

dist.all_gather_object(all_expert_maps, self.tensor_data)

for rank_id, expert_map_tensor in enumerate(all_expert_maps):

if self.tensor_data != expert_map_tensor:

raise ValueError(

f"The expert map of rank{rank} is not equal to rank{rank_id}"

)

return True

except Exception as e:

raise ValueError(

f"The expert maps of all ranks are inconsistency: {e}")

def check_expert_map_tensor(self):

if dist.is_initialized():

try:

rank = dist.get_rank()

world_size = dist.get_world_size()

all_expert_maps = [None for _ in range(world_size)]

current_map_list = self.expert_map_tensor.tolist()

dist.all_gather_object(all_expert_maps, current_map_list)

for rank_id, other_map_list in enumerate(all_expert_maps):

if current_map_list != other_map_list:

raise ValueError(

f"The expert map of rank {rank} is not equal to rank {rank_id}"

)

return True

except Exception as e:

raise ValueError(

f"The expert maps of all ranks are inconsistent: {e}")

github-actions · 2025-10-23T07:17:20Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Signed-off-by: offline0806 <[email protected]>

…0-dev

[BugFix]Check all expert maps when using muilty instance.

cc75987

Signed-off-by: offline0806 <[email protected]>

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

[BugFix]change tensor data to class param.

1eb902e

Signed-off-by: offline0806 <[email protected]>

github-actions bot added the module:ops label Oct 23, 2025

offline893 and others added 3 commits October 24, 2025 14:20

Merge branch 'vllm-project:v0.11.0-dev' into v0.11.0-dev

5f9fa05

Merge remote-tracking branch 'upstream/v0.11.0-dev' into v0.11.0-dev

c754c20

Merge remote-tracking branch 'upstream_gitee/v0.11.0-dev' into v0.11.…

4692f0d

…0-dev

yiz-liu approved these changes Oct 24, 2025

View reviewed changes

yiz-liu merged commit 4e21b15 into vllm-project:v0.11.0-dev Oct 24, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BugFix]Check all expert maps when using muilty instance. #3662

[BugFix]Check all expert maps when using muilty instance. #3662

Uh oh!

offline893 commented Oct 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[BugFix]Check all expert maps when using muilty instance. #3662

[BugFix]Check all expert maps when using muilty instance. #3662

Uh oh!

Conversation

offline893 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

offline893 commented Oct 23, 2025 •

edited

Loading