Skip to content

fix: prefer IPv4 non-link-local in _get_host_ip for Docker (#743)#778

Open
lonexreb wants to merge 1 commit into
meta-pytorch:mainfrom
lonexreb:fix/743-prefer-ipv4-host-detection
Open

fix: prefer IPv4 non-link-local in _get_host_ip for Docker (#743)#778
lonexreb wants to merge 1 commit into
meta-pytorch:mainfrom
lonexreb:fix/743-prefer-ipv4-host-detection

Conversation

@lonexreb

@lonexreb lonexreb commented Jun 6, 2026

Copy link
Copy Markdown

Summary

Fixes #743.

In containers, socket.gethostname() + DNS resolution often surfaces an IPv6 link-local address (fe80::...) first. c10d / TCPStore cannot use those, producing:

[c10d] The IPv4 network addresses of (fe80::222:48ff:fe49:ba90, 51391) cannot be retrieved
       (gai error: -9 - Address family for hostname not supported).

Replace the unguarded get_ipaddr(hostname, 0) call with an explicit getaddrinfo scan that:

  1. Prefers IPv4 non-link-local
  2. Then IPv6 non-link-local (skipping fe80::/10)
  3. Skips IPv4 link-local (169.254.0.0/16)
  4. Falls back to get_ipaddr only when nothing usable surfaces or resolution itself fails

VLLM_HOST_IP override is preserved.

  • 1 function changed in src/forge/actors/vllm/v1/monarch_executor.py (+27/-3)
  • New unit tests: tests/unit_tests/test_get_host_ip.py (6 cases)

Test plan

  • VLLM_HOST_IP env override still honored
  • IPv4 wins over IPv6 link-local (the Networking error in Docker due to host IP detection (workaround: set VLLM_HOST_IP) #743 regression)
  • IPv4 link-local (169.254/16) skipped
  • IPv6 link-local skipped, global IPv6 preferred when no IPv4 present
  • Falls back to Monarch's get_ipaddr when every candidate is link-local
  • Falls back to Monarch's get_ipaddr on gaierror
  • Manual: confirm python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml no longer needs VLLM_HOST_IP=127.0.0.1 workaround inside Docker (cannot run locally — please verify in container CI)

Notes

The reporter (@insop) offered to open a PR; happy to defer to theirs if they prefer. The behavior matches what they proposed in the issue.

…rch#743)

In containers, hostname resolution often returns an IPv6 link-local
address (fe80::...) which c10d / TCPStore cannot use, producing:

  [c10d] The IPv4 network addresses of (fe80::..., 51391) cannot be
         retrieved (gai error: -9 - Address family for hostname not
         supported).

Replace the unguarded get_ipaddr(hostname, 0) call with an explicit
getaddrinfo scan that prefers IPv4 non-link-local, then IPv6 non-link-
local, falling back to get_ipaddr only when nothing usable surfaces or
resolution fails outright. VLLM_HOST_IP override is preserved.

Test plan: tests/unit_tests/test_get_host_ip.py
- VLLM_HOST_IP override honored
- IPv4 wins over IPv6 link-local (regression for meta-pytorch#743)
- IPv4 link-local (169.254/16) skipped
- IPv6 link-local skipped, global IPv6 preferred when IPv4 absent
- Falls back to Monarch resolver when only link-local addrs exist
- Falls back to Monarch resolver on gaierror
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Networking error in Docker due to host IP detection (workaround: set VLLM_HOST_IP)

1 participant