Add DNS-AID service discovery for Monarch actor pools#772
Open
IngmarVG-IB wants to merge 1 commit into
Open
Conversation
Monarch actor pools (generator, trainer, evaluator, replay_buffer) can now self-register DNS-AID SVCB ServiceMode records on startup, enabling peer discovery via DNS rather than hard-coded coordinator addresses. - Add dns_aid.py helper module with publish/unpublish/discover wrappers - ForgeActor.as_service() calls publish_service() after listener is bound - Provisioner.shutdown_all_allocations() calls unpublish_service() (best-effort) - Controlled by DNS_AID_ENABLED env var (default: false) - dns-aid added as optional dependency: pip install forge[dns-aid] - Unit tests with mocked dns_aid calls - Documentation in docs/source/dns_aid.md Closes: meta-pytorch#771 Depends-on: infobloxopen/dns-aid-core (pip: dns-aid)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dns_aid.pyhelper module withpublish_service/unpublish_service/discover_peerswrappers around dns-aidDNS_AID_ENABLEDenv var (default:false) + per-serviceDnsAidConfigMotivation
In multi-cluster or hybrid deployments, services need a way to find each other without static configuration. DNS-based service discovery via SVCB records is lightweight and infrastructure-agnostic. See #771 for discussion.
Changes
pyproject.tomldns-aidoptional dependencysrc/forge/env.pyDNS_AID_ENABLEDenv varsrc/forge/types.pyDnsAidConfigdataclass + field onServiceConfigsrc/forge/controller/dns_aid.pysrc/forge/controller/actor.pyas_service()callspublish_service()after initsrc/forge/controller/provisioner.pyshutdown_all_allocations()callsunpublish_service()src/forge/controller/service/interface.py_dns_aid_cfgattribute onServiceInterfacetests/unit_tests/test_dns_aid.pydocs/source/dns_aid.mdDesign decisions
DNS_AID_ENABLED=trueenv var andDnsAidConfig(enabled=True)must be set. Env var acts as global kill switch.DnsAidConfig.portmust be set by the user (e.g. load balancer/gateway port) since Monarch services use actor RPC with no auto-detected listener port.dns-aidis imported lazily on first use with a cached result, so the missing-package warning only fires once.discover_peers()retries with backoff (max 5 attempts) to handle startup races.retry_on_emptyparam controls whether empty-but-successful responses trigger retries.Test plan
pytest tests/unit_tests/test_dns_aid.py -v— all 19 tests passflake8 --config=.flake8on all changed files — cleanufmt checkon all changed files — formattedpytest tests/unit_tests/ -v)DNS_AID_ENABLEDis unsetCloses #771
Depends-on: dns-aid (
pip install dns-aid>=0.12.0)