Master cluster join#68576
Open
dwoz wants to merge 15 commits intosaltstack:3008.xfrom
Open
Conversation
twangboy
previously approved these changes
Apr 21, 2026
Implement automatic discovery and joining for master cluster nodes. Masters can now join an existing cluster without pre-configuring all peer keys, enabling dynamic cluster scaling. Changes: - Add peer discovery protocol with signed messages and token-based replay protection - Implement secure cluster join workflow with cluster_secret validation - Add automatic synchronization of cluster, peer, and minion keys - Support dynamic pusher creation for newly discovered peers - Refactor key generation to support in-memory key operations - Add PrivateKeyString and PublicKeyString classes for key handling - Allow cluster_id configuration with cluster_secret for autoscale The join protocol: 1. New master sends signed discover message to bootstrap peer 2. Bootstrap peer responds with cluster public key 3. New master validates and sends encrypted cluster_secret 4. Bootstrap peer validates secret and replies with cluster keys 5. All cluster members notified of new peer 6. New master starts normal operation with full cluster state
Make the three-master cluster scenario (test_cluster_key_rotation) reliable by addressing several stale APIs and an IPC event-bus reconnect storm that caused the minion start event to be dropped. salt/transport/tcp.py - PublishClient.recv(timeout=...) no longer closes and reconnects the subscriber stream on every idle timeout. It now drains buffered messages first, waits with a bounded read_bytes timeout, and only reconnects on a genuine StreamClosedError. This was the root cause of the master_event_pub.ipc reconnect-every-5-seconds loop that was dropping events between disconnect and reconnect. salt/crypt.py - PublicKey.__init__ now accepts PEM bytes/str directly (matching PrivateKey and what BaseKey.from_file / from_str / MasterKeys.fetch actually pass in). Legacy callers that still pass a filesystem path continue to work via a simple PEM-header heuristic. salt/channel/server.py - Treat cluster_secret as optional: statically-configured peers do not need a shared secret (it is only required when a new node wants to dynamically join). - handle_pool_publish: fix the join-reply construction that treated a Path as a PublicKey, typo'd payload, and referenced an undefined local. It now loads the joiner's pub key, normalizes token / aes / cluster-key bytes, and encrypts correctly. - handle_pool_publish: do not overwrite peer .pub files that are already on disk (statically-configured peers have them as 0400); log a warning on mismatch instead. - handle_pool_publish: wrap cross-peer publishes in try/except so a peer that has not started yet no longer aborts the whole handler. - handle_pool_publish signature accepts only `payload` to match how TCPPuller invokes it. - send_aes_key_event: use PublicKey(str(peer_pub)) which is the correct path-based call site (from_file was incorrectly used previously). - Load master RSA key via PrivateKey.from_file(path) rather than the bytes-only PrivateKey(path) constructor. Made-with: Cursor
- salt/channel/server.py: import errno (referenced at factory() OSError handler for EADDRINUSE fallback) and apply black formatting. - salt/crypt.py: remove the duplicate PublicKey class definition that shadowed the canonical bytes-accepting class at line 386. All encrypt/verify/decrypt methods now live inside that single class, and PublicKeyString/PrivateKeyString follow after. - salt/channel/server.py (send_aes_key_event): use PublicKey.from_file(peer_pub) now that the canonical PublicKey takes PEM bytes (which is what BaseKey.from_file reads off disk). Made-with: Cursor
…t-called Addresses lint-salt CI failures: - salt/channel/server.py: remove unused `shutil` import (W0611). - salt/crypt.py: add missing `getpass` import used by _write_private and _write_public (E0602). - salt/crypt.py: suppress super-init-not-called for PrivateKeyString and PublicKeyString (W0231); they intentionally bypass the base __init__ since they load from a string instead of bytes. Made-with: Cursor
…create=False
Two unrelated but co-located crashes in the cluster init path:
salt/channel/server.py: MasterPubServerChannel.factory imported
salt.transport.tcp conditionally inside the ``if opts.get("cluster_id")``
branch. That made ``salt`` a function-local name, so on non-cluster masters
the ``else`` branch's ``salt.transport.ipc_publish_server("master", opts)``
raised UnboundLocalError before the import ever ran, preventing the master
from starting. Hoist the import to module scope.
salt/crypt.py: MasterKeys.__init__ had a dead block that tried to seed the
cluster key early via ``self.__get_keys(name="cluster", pki_dir=...)``. That
method does not exist and ``pki_dir`` is not a valid kwarg of
``find_or_create_keys``; it only crashed now because a test exercises the
path with autocreate=False. Remove the block (cluster key setup is already
handled correctly in _setup_keys() once master keys exist), and guard
check_master_shared_pub against storing a ``None`` master_pub when the
master key has not been written yet.
Made-with: Cursor
tests/pytests/unit/conftest.py: ``mocked_tcp_pub_client`` built an ``asyncio.Future()`` at fixture setup time. When a preceding test in the same session calls ``asyncio.set_event_loop(None)`` (a common teardown), the default-loop policy's ``_set_called`` flag stays True and the next fixture setup hits ``RuntimeError: There is no current event loop in thread 'MainThread'``. This caused 9 setup errors in unit 2 across tests/pytests/unit/modules/test_beacons.py and state/test_state.py. Use ``AsyncMock(return_value=True)`` for ``transport.connect`` so the fixture no longer depends on a current or default event loop. tests/pytests/functional/master/test_event_publisher.py: the test's ``leak_threshold`` was a fixed 150 MB, which is smaller than the ~263 MB baseline the EventPublisher inherits on CI after the loader tests have populated the forked-from parent. Restore the original relative threshold (``baseline + baseline * 0.5``) -- the test is checking for a leak, not absolute RSS, so it should scale with whatever the baseline happens to be on a given runner. Made-with: Cursor
Pre-commit's black hook rejected ``#and`` (missing space after ``#``) on an inline commented-out clause in Master.verify_environment. Made-with: Cursor
…ter masters
``_publish_daemon`` only initialized ``self.pushers = []`` inside the
``if self.opts.get('cluster_id')`` branch, but ``publish_payload``
iterates ``self.pushers`` on every event regardless of cluster mode.
Non-cluster masters therefore raised ``AttributeError: 'MasterPubServerChannel'
object has no attribute 'pushers'`` out of the publish coroutine on
every event. The preceding ``asyncio.create_task(...)`` for the local
publish had already been scheduled, so most events still reached
subscribers and most tests passed -- but events occasionally races
poorly and the subscriber never gets the return, leaving
``LocalClient.get_iter_returns`` to spin until its ~90s timeout.
The original baseline commit (78feacf) initialized ``self.pushers``
unconditionally at the top of ``_publish_daemon``; that got moved into
the cluster branch during later refactoring. Restore the unconditional
init so non-cluster masters also have an empty list to iterate.
Also drop a leftover debug ``log.warning("SEND AES KEY EVENT ...")``
with ``traceback.format_stack`` that was spamming every master's logs.
Fixes the Debian 13 integration zeromq 1 timeouts in CI run 24641186096:
- tests/integration/loader/test_ext_modules.py::LoaderOverridesTest::test_overridden_internal
- tests/integration/modules/test_cp.py::CPModuleTest::test_get_file_str_local
- tests/integration/modules/test_cp.py::CPModuleTest::test_get_url_file_no_dest
- tests/integration/modules/test_mine.py::MineTest::test_mine_delete
- tests/integration/modules/test_mine.py::MineTest::test_mine_flush
- tests/integration/modules/test_mine.py::MineTest::test_send
- tests/integration/modules/test_status.py::StatusModuleTest::test_status_procs
- tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show
- tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show_linux
- tests/integration/modules/test_test.py::TestModuleTest::test_get_opts
Made-with: Cursor
``fileclient.Client.get_url`` called ``ftp.connect(host, port)`` with no timeout. Python's ``ftplib.FTP.connect`` then calls ``socket.create_connection`` with no timeout, which walks ``getaddrinfo`` results in order and blocks until each address's TCP SYN exchange exhausts kernel retransmits (~2+ minutes) before falling through to the next result. ``ftp.freebsd.org`` now publishes both ``A`` and ``AAAA`` records. On hosts with a v6 address but no working upstream route to ``2001:5a8:601:4b::/48`` the SYN is silently dropped and the v6 attempt hangs indefinitely, so ``tests/integration/modules/test_cp.py`` ``::CPModuleTest::test_get_url_ftp`` -- and any real minion doing an ``ftp://`` ``cp.get_url`` against a dual-stack host -- stalls well past any reasonable job timeout. Pass an explicit timeout (configurable via the new ``fileserver_ftp_timeout`` opt, default 30s) so an unreachable address-family falls through to the next ``getaddrinfo`` result promptly and ``cp.get_url`` either succeeds via IPv4 or returns a clear ``MinionError`` rather than hanging the minion process. Locally, ``test_get_url_ftp`` and ``test_get_url_https`` now both pass in ~45s on a dual-stack box where they previously hung for the full test deadline. Made-with: Cursor
OptsDict.pop() previously returned the value for keys stored in the
parent chain but left them visible to subsequent iteration and ``in``
checks, breaking the ``dict.pop`` contract. salt.cache.mysql_cache
relies on ``opts.pop("mysql.table_name", ...)`` actually deleting the
key so the follow-up ``for k in opts: if k.startswith("mysql.")`` loop
does not re-emit it as a ``table_name`` kwarg to
``MySQLdb.connect``, which surfaced in CI as
``TypeError: Connection.__init__() got an unexpected keyword argument
'table_name'`` for every ``test_mysql.py::test_caching[...]``
parametrization. Mask parent-chain pops with the existing ``_DELETED``
sentinel (same mechanism ``__delitem__`` uses).
Also replace the fragile ``/proc/<pid>/fd`` count-delta assertion in
``test_subprocess_list_fds`` with a direct check on the
``Popen.sentinel`` fd's ``/proc/<pid>/fd/<sentinel>`` symlink target.
The count-delta approach was masking the pipe allocation on
long-running pytest workers (CI saw ``assert 706 == (706 + 2)``)
whenever unrelated GC/finalizer activity closed two fds between the
two measurements; the sentinel-target check is robust to that noise
while still catching real ``SubprocessList.cleanup`` regressions.
Made-with: Cursor
Adds operator-facing documentation for the master cluster dynamic-join flow that landed in this branch, along with a small hardening of the cluster-pub pinning code it relies on. Docs: - Add a ``Dynamic Join`` section to the master-cluster tutorial that walks through the discover -> discover-reply -> join -> join-reply / join-notify handshake, a minimal joining-master config, security considerations for ``cluster_secret``, and how to decommission a peer. This documents the flow exercised by the new ``test_fourth_master_joins_existing_cluster`` scenario test. - Add conf_master entries for ``cluster_secret`` and ``cluster_pub_fingerprint`` in the master configuration reference. cluster_pub fingerprint: - The existing cluster-pub pinning code used SHA-1, was misnamed ``cluster_pub_signature`` (it is a digest, not a signature), was missing from ``VALID_OPTS`` / ``DEFAULT_MASTER_OPTS``, and had a typo (``clsuter_pub_signature``) in the comparison that would ``KeyError`` the handler whenever an operator actually enabled it. That option is dead on arrival today but the no-shared-filesystem topology on our roadmap needs something like it. - Rename to ``cluster_pub_fingerprint`` and switch to SHA-256 with a constant-time compare via ``hmac.compare_digest``. Accept either ``str`` or ``bytes`` for the PEM input. The comparison is case-insensitive but requires the full 64-char hex digest -- truncated prefixes are rejected. - Drop the ``"No cluster signature provided, trusting ..."`` warning that fired on every discover-reply in normal operation. The mismatch path now logs which peer sent the bad reply so operators can trace it. - Register both ``cluster_pub_fingerprint`` and ``cluster_secret`` in ``VALID_OPTS`` / ``DEFAULT_MASTER_OPTS``; ``cluster_secret`` was already referenced by ``config/__init__.py`` and ``channel/server.py`` but was never declared in the schema. Tests: - Add ``TestClusterPubFingerprint`` (7 unit tests) covering: unset / ``None`` / empty option (TOFU), matching digest (including mixed-case), mismatched digest, ``bytes`` PEM input, explicit SHA-1 regression guard, and truncated-prefix rejection. Made-with: Cursor
Collapse the verbose Dynamic Join tutorial sections into a single intro + config + security-notes + peer-removal block, and shorten the cluster_secret / cluster_pub_fingerprint conf entries to a paragraph each. Net -131 lines, same content coverage. Made-with: Cursor
The ``cluster/peer/join`` handler carried a block of commented code that was meant to populate a ``peers`` / ``minions`` dictionary into the join-reply payload -- the start of a no-shared-filesystem bootstrap. It referenced a non-existent ``reply`` variable and paired with a consumer (``cluster/peer/join-reply`` handler) that does not yet unpack the signed envelope, so it would not compile or work if re-enabled. Replace the dead lines with a short XXX noting the gap for whoever picks up the no-shared-filesystem work. Made-with: Cursor
…ments Extend PoolRoutingChannel for inline clear auth and post_fork crypticle handling consistent with ReqServerChannel, and tighten routing guards. Bump warn_until codes from 3008 to 3009 on transport shims and related utilities. Refresh static CI and packaging requirement pins. Update tests for salt-run version parity, nxos module API changes, multimaster failover timing, netapi conftest, and minor lint fixes. Made-with: Cursor
…uple A trailing comma after .format() wrapped the deprecation text in a tuple, breaking tests that match the exception message when RAISE_DEPRECATIONS_RUNTIME_ERRORS=1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
What issues does this PR fix or reference?
Fixes
Previous Behavior
Remove this section if not relevant
New Behavior
Remove this section if not relevant
Merge requirements satisfied?
[NOTICE] Bug fixes or features added to Salt require tests.
Commits signed with GPG?
Yes/No