Skip to content

Add DNS-AID service discovery for Monarch actor pools#772

Open
IngmarVG-IB wants to merge 1 commit into
meta-pytorch:mainfrom
IngmarVG-IB:feat/dns-aid-service-discovery
Open

Add DNS-AID service discovery for Monarch actor pools#772
IngmarVG-IB wants to merge 1 commit into
meta-pytorch:mainfrom
IngmarVG-IB:feat/dns-aid-service-discovery

Conversation

@IngmarVG-IB

Copy link
Copy Markdown

Summary

  • Monarch actor pools can now self-register DNS-AID SVCB records on startup, enabling peer discovery via DNS rather than hard-coded coordinator addresses
  • New dns_aid.py helper module with publish_service/unpublish_service/discover_peers wrappers around dns-aid
  • Fully opt-in: controlled by DNS_AID_ENABLED env var (default: false) + per-service DnsAidConfig

Motivation

In multi-cluster or hybrid deployments, services need a way to find each other without static configuration. DNS-based service discovery via SVCB records is lightweight and infrastructure-agnostic. See #771 for discussion.

Changes

File Change
pyproject.toml dns-aid optional dependency
src/forge/env.py DNS_AID_ENABLED env var
src/forge/types.py DnsAidConfig dataclass + field on ServiceConfig
src/forge/controller/dns_aid.py New — publish/unpublish/discover with dual-guard, retry, lazy import
src/forge/controller/actor.py as_service() calls publish_service() after init
src/forge/controller/provisioner.py shutdown_all_allocations() calls unpublish_service()
src/forge/controller/service/interface.py _dns_aid_cfg attribute on ServiceInterface
tests/unit_tests/test_dns_aid.py New — 19 unit tests with mocked dns_aid calls
docs/source/dns_aid.md New — usage guide

Design decisions

  • Best-effort: DNS failures never block service startup or shutdown. TTL-based expiry (default 30s) handles crash cleanup.
  • Dual enable guard: Both DNS_AID_ENABLED=true env var and DnsAidConfig(enabled=True) must be set. Env var acts as global kill switch.
  • Explicit port: DnsAidConfig.port must be set by the user (e.g. load balancer/gateway port) since Monarch services use actor RPC with no auto-detected listener port.
  • Cached lazy import: dns-aid is imported lazily on first use with a cached result, so the missing-package warning only fires once.
  • Exponential backoff: discover_peers() retries with backoff (max 5 attempts) to handle startup races. retry_on_empty param controls whether empty-but-successful responses trigger retries.

Test plan

  • pytest tests/unit_tests/test_dns_aid.py -v — all 19 tests pass
  • flake8 --config=.flake8 on all changed files — clean
  • ufmt check on all changed files — formatted
  • Verify existing tests still pass (pytest tests/unit_tests/ -v)
  • Verify no DNS-AID calls when DNS_AID_ENABLED is unset

Closes #771
Depends-on: dns-aid (pip install dns-aid>=0.12.0)

Monarch actor pools (generator, trainer, evaluator, replay_buffer) can now
self-register DNS-AID SVCB ServiceMode records on startup, enabling peer
discovery via DNS rather than hard-coded coordinator addresses.

- Add dns_aid.py helper module with publish/unpublish/discover wrappers
- ForgeActor.as_service() calls publish_service() after listener is bound
- Provisioner.shutdown_all_allocations() calls unpublish_service() (best-effort)
- Controlled by DNS_AID_ENABLED env var (default: false)
- dns-aid added as optional dependency: pip install forge[dns-aid]
- Unit tests with mocked dns_aid calls
- Documentation in docs/source/dns_aid.md

Closes: meta-pytorch#771
Depends-on: infobloxopen/dns-aid-core (pip: dns-aid)
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: DNS-based service discovery for Monarch actor pools

1 participant