- 
                Notifications
    You must be signed in to change notification settings 
- Fork 530
[BugFix]Check all expert maps when using muilty instance. #3662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: offline0806 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a check to ensure all expert maps are consistent across multiple instances, which is a crucial bug fix for distributed MoE setups. The implementation in expert_load_balancer.py has a critical issue where it references a non-existent attribute self.tensor_data, which will cause a runtime error. I've provided a suggestion to fix this by using self.expert_map_tensor and correctly comparing the maps. I also noticed a small typo in the PR title ('muilty' should be 'multi').
| def check_expert_map_tensor(self): | ||
| if dist.is_initialized(): | ||
| try: | ||
| rank = dist.get_rank() | ||
| world_size = dist.get_world_size() | ||
| all_expert_maps = [None for _ in range(world_size)] | ||
| dist.all_gather_object(all_expert_maps, self.tensor_data) | ||
| for rank_id, expert_map_tensor in enumerate(all_expert_maps): | ||
| if self.tensor_data != expert_map_tensor: | ||
| raise ValueError( | ||
| f"The expert map of rank{rank} is not equal to rank{rank_id}" | ||
| ) | ||
| return True | ||
| except Exception as e: | ||
| raise ValueError( | ||
| f"The expert maps of all ranks are inconsistency: {e}") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method has a critical bug: it references self.tensor_data, which is not an attribute of the ExpertLoadBalancer class. This will cause a NameError at runtime. Based on the class's __init__ method, you probably meant to use self.expert_map_tensor.
Additionally, comparing tensors with != is not suitable for checking equality in a boolean context, as it performs an element-wise comparison and returns a boolean tensor.
To fix this, I suggest converting the tensor to a list before gathering and comparing the lists. This approach seems to align with your use of dist.all_gather_object.
I've also corrected a typo ('inconsistency' -> 'inconsistent') and improved f-string formatting for better readability.
| def check_expert_map_tensor(self): | |
| if dist.is_initialized(): | |
| try: | |
| rank = dist.get_rank() | |
| world_size = dist.get_world_size() | |
| all_expert_maps = [None for _ in range(world_size)] | |
| dist.all_gather_object(all_expert_maps, self.tensor_data) | |
| for rank_id, expert_map_tensor in enumerate(all_expert_maps): | |
| if self.tensor_data != expert_map_tensor: | |
| raise ValueError( | |
| f"The expert map of rank{rank} is not equal to rank{rank_id}" | |
| ) | |
| return True | |
| except Exception as e: | |
| raise ValueError( | |
| f"The expert maps of all ranks are inconsistency: {e}") | |
| def check_expert_map_tensor(self): | |
| if dist.is_initialized(): | |
| try: | |
| rank = dist.get_rank() | |
| world_size = dist.get_world_size() | |
| all_expert_maps = [None for _ in range(world_size)] | |
| current_map_list = self.expert_map_tensor.tolist() | |
| dist.all_gather_object(all_expert_maps, current_map_list) | |
| for rank_id, other_map_list in enumerate(all_expert_maps): | |
| if current_map_list != other_map_list: | |
| raise ValueError( | |
| f"The expert map of rank {rank} is not equal to rank {rank_id}" | |
| ) | |
| return True | |
| except Exception as e: | |
| raise ValueError( | |
| f"The expert maps of all ranks are inconsistent: {e}") | 
| 👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge: 
 If CI fails, you can run linting and testing checks locally according Contributing and Testing. | 
Signed-off-by: offline0806 <[email protected]>
What this PR does / why we need it?
Check all expert maps when using muilty instance.
Does this PR introduce any user-facing change?
How was this patch tested?
Qwen 235B in double A3.
case1:master has expert map, slave has not expert map.
case2: master has expert map, slave has error expert map.
case3: master has expert map,slave has correct expert map.