Fix/preflight handle local routes in route_to_master#566
Open
alexsu52 wants to merge 1 commit intoAMD-AGI:mainfrom
Open
Fix/preflight handle local routes in route_to_master#566alexsu52 wants to merge 1 commit intoAMD-AGI:mainfrom
alexsu52 wants to merge 1 commit intoAMD-AGI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request fixes a warning that appears on master hosts in distributed training setups by detecting local routes and resolving the correct physical network interface instead of defaulting to the loopback interface.
Changes:
- Added
_device_for_local_ip()helper function to look up the network interface associated with a given IP address - Modified
route_to_master()to detect local routes (whenip route getreturns "local") and resolve the actual physical interface using the new helper function - Updated docstring to document the new local route handling behavior
Comments suppressed due to low confidence (5)
primus/tools/preflight/network/network_probe.py:104
- The
_device_for_local_ipfunction callsip -o addr showwithout specifying IPv4 only (using-4flag). This means it will return both IPv4 and IPv6 addresses. If there are IPv6 addresses in the output containing the IPv4 address as a substring, this could lead to incorrect matches. Consider adding the-4flag to match the pattern used inlist_ipv4_addrs()which uses["ip", "-o", "-4", "addr", "show"].
["ip", "-o", "addr", "show"],
primus/tools/preflight/network/network_probe.py:120
- The newly introduced helper function
_device_for_local_ipis duplicating functionality that already exists inlist_ipv4_addrs()on lines 71-97. Thelist_ipv4_addrs()function already parsesip addr showoutput and returns a mapping of interface names to IP addresses. Instead of creating a new function with potentially buggy substring matching, consider reusinglist_ipv4_addrs()which has proper regex parsing and already handles the same command output. This would avoid code duplication and leverage existing, more robust parsing logic.
def _device_for_local_ip(host_ip: str) -> Optional[str]:
"""Resolve the interface that has the given local IP from `ip -o addr show`."""
try:
r = subprocess.run(
["ip", "-o", "addr", "show"],
check=False,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
timeout=2,
)
if r.returncode != 0:
return None
for line in r.stdout.strip().splitlines():
if host_ip in line:
parts = line.split()
if len(parts) > 1:
return parts[1]
return None
except Exception:
return None
primus/tools/preflight/network/network_probe.py:178
- When a local route is detected (line 160), the code tries to look up the device but doesn't update the
src_ipvalue. For local routes, thesrc_ipextracted from the route output might be the loopback address or might not be set correctly. Consider whethersrc_ipshould also be updated to reflect the actual IP address on the resolved interface, similar to howdevis updated. This ensures consistency in the returned data structure.
if s.startswith("local "):
try:
host_ip = socket.gethostbyname(socket.gethostname())
except Exception as e:
return {
"ok": False,
"master_addr": master_addr,
"master_ip": master_ip,
"error": f"resolve local IP failed: {e}",
}
dev = _device_for_local_ip(host_ip) or dev
return {
"ok": True,
"master_addr": master_addr,
"master_ip": master_ip,
"dev": dev,
"src_ip": src_ip,
primus/tools/preflight/network/network_probe.py:114
- The string matching logic using
if host_ip in lineis unsafe and can produce false positives. For example, if the host IP is "10.0.1.2" and there's another IP like "10.0.1.20" or "10.0.1.200" in the output, it will incorrectly match. This should use a regex to match the exact IP address as a complete token, similar to howlist_ipv4_addrs()does it with the patternr"inet\s+(\d+\.\d+\.\d+\.\d+)/". Consider matching the IP with proper boundaries or using the existinglist_ipv4_addrs()function which already parses this information correctly.
if host_ip in line:
primus/tools/preflight/network/network_probe.py:171
- There's a logic issue with how the device lookup is called. The function looks up the interface for
host_ip(fromsocket.gethostbyname(socket.gethostname())), but ideally it should be looking up the interface formaster_ipsince that's the address we're trying to route to. In the local route case,master_ipis the same as this host's IP, so the lookup should usemaster_ipinstead of deriving a separatehost_ip. This ensures consistency with the routing decision.
if s.startswith("local "):
try:
host_ip = socket.gethostbyname(socket.gethostname())
except Exception as e:
return {
"ok": False,
"master_addr": master_addr,
"master_ip": master_ip,
"error": f"resolve local IP failed: {e}",
}
dev = _device_for_local_ip(host_ip) or dev
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes:
Reason for changes:
Warning suppression for master host:
[Primus:Preflight] WARN: Socket IFNAME does not match route-to-master interface (may hang init_process_group)