Skip to content

Fix/preflight handle local routes in route_to_master#566

Open
alexsu52 wants to merge 1 commit intoAMD-AGI:mainfrom
alexsu52:fix/preflight/network
Open

Fix/preflight handle local routes in route_to_master#566
alexsu52 wants to merge 1 commit intoAMD-AGI:mainfrom
alexsu52:fix/preflight/network

Conversation

@alexsu52
Copy link
Contributor

Changes:

  • Detect when ip route get returns a local route (i.e., when the node is the master).
  • Added logic to look up the actual physical interface corresponding to the local IP using ip addr show, rather than accepting the route's output which typically defaults to lo (loopback).

Reason for changes:

Warning suppression for master host:
[Primus:Preflight] WARN: Socket IFNAME does not match route-to-master interface (may hang init_process_group)

Copilot AI review requested due to automatic review settings February 26, 2026 10:36
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a warning that appears on master hosts in distributed training setups by detecting local routes and resolving the correct physical network interface instead of defaulting to the loopback interface.

Changes:

  • Added _device_for_local_ip() helper function to look up the network interface associated with a given IP address
  • Modified route_to_master() to detect local routes (when ip route get returns "local") and resolve the actual physical interface using the new helper function
  • Updated docstring to document the new local route handling behavior
Comments suppressed due to low confidence (5)

primus/tools/preflight/network/network_probe.py:104

  • The _device_for_local_ip function calls ip -o addr show without specifying IPv4 only (using -4 flag). This means it will return both IPv4 and IPv6 addresses. If there are IPv6 addresses in the output containing the IPv4 address as a substring, this could lead to incorrect matches. Consider adding the -4 flag to match the pattern used in list_ipv4_addrs() which uses ["ip", "-o", "-4", "addr", "show"].
            ["ip", "-o", "addr", "show"],

primus/tools/preflight/network/network_probe.py:120

  • The newly introduced helper function _device_for_local_ip is duplicating functionality that already exists in list_ipv4_addrs() on lines 71-97. The list_ipv4_addrs() function already parses ip addr show output and returns a mapping of interface names to IP addresses. Instead of creating a new function with potentially buggy substring matching, consider reusing list_ipv4_addrs() which has proper regex parsing and already handles the same command output. This would avoid code duplication and leverage existing, more robust parsing logic.
def _device_for_local_ip(host_ip: str) -> Optional[str]:
    """Resolve the interface that has the given local IP from `ip -o addr show`."""
    try:
        r = subprocess.run(
            ["ip", "-o", "addr", "show"],
            check=False,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            timeout=2,
        )
        if r.returncode != 0:
            return None
        for line in r.stdout.strip().splitlines():
            if host_ip in line:
                parts = line.split()
                if len(parts) > 1:
                    return parts[1]
        return None
    except Exception:
        return None

primus/tools/preflight/network/network_probe.py:178

  • When a local route is detected (line 160), the code tries to look up the device but doesn't update the src_ip value. For local routes, the src_ip extracted from the route output might be the loopback address or might not be set correctly. Consider whether src_ip should also be updated to reflect the actual IP address on the resolved interface, similar to how dev is updated. This ensures consistency in the returned data structure.
        if s.startswith("local "):
            try:
                host_ip = socket.gethostbyname(socket.gethostname())
            except Exception as e:
                return {
                    "ok": False,
                    "master_addr": master_addr,
                    "master_ip": master_ip,
                    "error": f"resolve local IP failed: {e}",
                }

            dev = _device_for_local_ip(host_ip) or dev

        return {
            "ok": True,
            "master_addr": master_addr,
            "master_ip": master_ip,
            "dev": dev,
            "src_ip": src_ip,

primus/tools/preflight/network/network_probe.py:114

  • The string matching logic using if host_ip in line is unsafe and can produce false positives. For example, if the host IP is "10.0.1.2" and there's another IP like "10.0.1.20" or "10.0.1.200" in the output, it will incorrectly match. This should use a regex to match the exact IP address as a complete token, similar to how list_ipv4_addrs() does it with the pattern r"inet\s+(\d+\.\d+\.\d+\.\d+)/". Consider matching the IP with proper boundaries or using the existing list_ipv4_addrs() function which already parses this information correctly.
            if host_ip in line:

primus/tools/preflight/network/network_probe.py:171

  • There's a logic issue with how the device lookup is called. The function looks up the interface for host_ip (from socket.gethostbyname(socket.gethostname())), but ideally it should be looking up the interface for master_ip since that's the address we're trying to route to. In the local route case, master_ip is the same as this host's IP, so the lookup should use master_ip instead of deriving a separate host_ip. This ensures consistency with the routing decision.
        if s.startswith("local "):
            try:
                host_ip = socket.gethostbyname(socket.gethostname())
            except Exception as e:
                return {
                    "ok": False,
                    "master_addr": master_addr,
                    "master_ip": master_ip,
                    "error": f"resolve local IP failed: {e}",
                }

            dev = _device_for_local_ip(host_ip) or dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants