[wip] Master cluster join by dwoz · Pull Request #68576 · saltstack/salt

dwoz · 2026-01-02T22:04:50Z

What does this PR do?

What issues does this PR fix or reference?

Fixes

Previous Behavior

Remove this section if not relevant

New Behavior

Remove this section if not relevant

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Docs
Changelog - https://docs.saltproject.io/en/master/topics/development/changelog.html
Tests written/updated

Commits signed with GPG?

Yes/No

Implement automatic discovery and joining for master cluster nodes. Masters can now join an existing cluster without pre-configuring all peer keys, enabling dynamic cluster scaling. Changes: - Add peer discovery protocol with signed messages and token-based replay protection - Implement secure cluster join workflow with cluster_secret validation - Add automatic synchronization of cluster, peer, and minion keys - Support dynamic pusher creation for newly discovered peers - Refactor key generation to support in-memory key operations - Add PrivateKeyString and PublicKeyString classes for key handling - Allow cluster_id configuration with cluster_secret for autoscale The join protocol: 1. New master sends signed discover message to bootstrap peer 2. Bootstrap peer responds with cluster public key 3. New master validates and sends encrypted cluster_secret 4. Bootstrap peer validates secret and replies with cluster keys 5. All cluster members notified of new peer 6. New master starts normal operation with full cluster state

Make the three-master cluster scenario (test_cluster_key_rotation) reliable by addressing several stale APIs and an IPC event-bus reconnect storm that caused the minion start event to be dropped. salt/transport/tcp.py - PublishClient.recv(timeout=...) no longer closes and reconnects the subscriber stream on every idle timeout. It now drains buffered messages first, waits with a bounded read_bytes timeout, and only reconnects on a genuine StreamClosedError. This was the root cause of the master_event_pub.ipc reconnect-every-5-seconds loop that was dropping events between disconnect and reconnect. salt/crypt.py - PublicKey.__init__ now accepts PEM bytes/str directly (matching PrivateKey and what BaseKey.from_file / from_str / MasterKeys.fetch actually pass in). Legacy callers that still pass a filesystem path continue to work via a simple PEM-header heuristic. salt/channel/server.py - Treat cluster_secret as optional: statically-configured peers do not need a shared secret (it is only required when a new node wants to dynamically join). - handle_pool_publish: fix the join-reply construction that treated a Path as a PublicKey, typo'd payload, and referenced an undefined local. It now loads the joiner's pub key, normalizes token / aes / cluster-key bytes, and encrypts correctly. - handle_pool_publish: do not overwrite peer .pub files that are already on disk (statically-configured peers have them as 0400); log a warning on mismatch instead. - handle_pool_publish: wrap cross-peer publishes in try/except so a peer that has not started yet no longer aborts the whole handler. - handle_pool_publish signature accepts only `payload` to match how TCPPuller invokes it. - send_aes_key_event: use PublicKey(str(peer_pub)) which is the correct path-based call site (from_file was incorrectly used previously). - Load master RSA key via PrivateKey.from_file(path) rather than the bytes-only PrivateKey(path) constructor. Made-with: Cursor

- salt/channel/server.py: import errno (referenced at factory() OSError handler for EADDRINUSE fallback) and apply black formatting. - salt/crypt.py: remove the duplicate PublicKey class definition that shadowed the canonical bytes-accepting class at line 386. All encrypt/verify/decrypt methods now live inside that single class, and PublicKeyString/PrivateKeyString follow after. - salt/channel/server.py (send_aes_key_event): use PublicKey.from_file(peer_pub) now that the canonical PublicKey takes PEM bytes (which is what BaseKey.from_file reads off disk). Made-with: Cursor

…t-called Addresses lint-salt CI failures: - salt/channel/server.py: remove unused `shutil` import (W0611). - salt/crypt.py: add missing `getpass` import used by _write_private and _write_public (E0602). - salt/crypt.py: suppress super-init-not-called for PrivateKeyString and PublicKeyString (W0231); they intentionally bypass the base __init__ since they load from a string instead of bytes. Made-with: Cursor

…create=False Two unrelated but co-located crashes in the cluster init path: salt/channel/server.py: MasterPubServerChannel.factory imported salt.transport.tcp conditionally inside the ``if opts.get("cluster_id")`` branch. That made ``salt`` a function-local name, so on non-cluster masters the ``else`` branch's ``salt.transport.ipc_publish_server("master", opts)`` raised UnboundLocalError before the import ever ran, preventing the master from starting. Hoist the import to module scope. salt/crypt.py: MasterKeys.__init__ had a dead block that tried to seed the cluster key early via ``self.__get_keys(name="cluster", pki_dir=...)``. That method does not exist and ``pki_dir`` is not a valid kwarg of ``find_or_create_keys``; it only crashed now because a test exercises the path with autocreate=False. Remove the block (cluster key setup is already handled correctly in _setup_keys() once master keys exist), and guard check_master_shared_pub against storing a ``None`` master_pub when the master key has not been written yet. Made-with: Cursor

…ressure salt/transport/tcp.py: PublishClient.recv used ``asyncio.wait_for`` to apply a timeout to an in-flight ``stream.read_bytes(...)``. On timeout the inner coroutine was cancelled, which left Tornado's IOStream._read_future set -- Tornado does not reset it on external cancellation. The next recv() call then hit ``assert self._read_future is None, "Already reading"`` in _start_read(). This broke test_minion_manager_async_stop and, in functional tests, caused test_minion_send_req_async to hang for 90s instead of honoring its 10s timeout. Rework recv() to keep a persistent per-client read task and wait on it with ``asyncio.wait`` (which does not cancel on timeout) or ``asyncio.shield`` (for the no-timeout branch). On timeout, the in-flight read is left running so the next recv() picks it up and no message is dropped. Clean up the task in close(). The existing non-blocking ``timeout=0`` semantics (peek + at most one read) are preserved. salt/transport/ipc.py: restore the backpressure check in IPCMessagePublisher.publish(): skip ``spawn_callback(self._write, ...)`` when the stream is already writing. Without it, pending write coroutines pile up in the event loop for slow/non-consuming subscribers, inflating EventPublisher RSS under high-frequency event firing (this is the fix from commit ed4b309 that was lost off this branch). Locally this drops test_publisher_mem peak from 120 MB to 90 MB with ~0.5 MB growth over a 60s publish run. Made-with: Cursor

tests/pytests/unit/conftest.py: ``mocked_tcp_pub_client`` built an ``asyncio.Future()`` at fixture setup time. When a preceding test in the same session calls ``asyncio.set_event_loop(None)`` (a common teardown), the default-loop policy's ``_set_called`` flag stays True and the next fixture setup hits ``RuntimeError: There is no current event loop in thread 'MainThread'``. This caused 9 setup errors in unit 2 across tests/pytests/unit/modules/test_beacons.py and state/test_state.py. Use ``AsyncMock(return_value=True)`` for ``transport.connect`` so the fixture no longer depends on a current or default event loop. tests/pytests/functional/master/test_event_publisher.py: the test's ``leak_threshold`` was a fixed 150 MB, which is smaller than the ~263 MB baseline the EventPublisher inherits on CI after the loader tests have populated the forked-from parent. Restore the original relative threshold (``baseline + baseline * 0.5``) -- the test is checking for a leak, not absolute RSS, so it should scale with whatever the baseline happens to be on a given runner. Made-with: Cursor

Pre-commit's black hook rejected ``#and`` (missing space after ``#``) on an inline commented-out clause in Master.verify_environment. Made-with: Cursor

…ter masters ``_publish_daemon`` only initialized ``self.pushers = []`` inside the ``if self.opts.get('cluster_id')`` branch, but ``publish_payload`` iterates ``self.pushers`` on every event regardless of cluster mode. Non-cluster masters therefore raised ``AttributeError: 'MasterPubServerChannel' object has no attribute 'pushers'`` out of the publish coroutine on every event. The preceding ``asyncio.create_task(...)`` for the local publish had already been scheduled, so most events still reached subscribers and most tests passed -- but events occasionally races poorly and the subscriber never gets the return, leaving ``LocalClient.get_iter_returns`` to spin until its ~90s timeout. The original baseline commit (78feacf) initialized ``self.pushers`` unconditionally at the top of ``_publish_daemon``; that got moved into the cluster branch during later refactoring. Restore the unconditional init so non-cluster masters also have an empty list to iterate. Also drop a leftover debug ``log.warning("SEND AES KEY EVENT ...")`` with ``traceback.format_stack`` that was spamming every master's logs. Fixes the Debian 13 integration zeromq 1 timeouts in CI run 24641186096: - tests/integration/loader/test_ext_modules.py::LoaderOverridesTest::test_overridden_internal - tests/integration/modules/test_cp.py::CPModuleTest::test_get_file_str_local - tests/integration/modules/test_cp.py::CPModuleTest::test_get_url_file_no_dest - tests/integration/modules/test_mine.py::MineTest::test_mine_delete - tests/integration/modules/test_mine.py::MineTest::test_mine_flush - tests/integration/modules/test_mine.py::MineTest::test_send - tests/integration/modules/test_status.py::StatusModuleTest::test_status_procs - tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show - tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show_linux - tests/integration/modules/test_test.py::TestModuleTest::test_get_opts Made-with: Cursor

``PublishClient.recv(timeout=0)`` gated the read on a raw-socket ``selectors.DefaultSelector`` peek. That misses data that Tornado has already pulled off the kernel socket into ``IOStream._read_buffer``: ``read_bytes`` would return immediately, but the kernel-level select says "not readable" because the socket has already been drained. Every subsequent ``recv(timeout=0)`` then returned ``None`` even though the event was sitting right there, ready to be decoded. ``LocalClient.get_returns_no_block`` polls with ``recv(timeout=0)``, so the symptom was job return events intermittently never reaching ``LocalClient``, leaving ``get_iter_returns`` to spin until its ~90s deadline. Most tests still passed (the event made it through before Tornado had a chance to pre-buffer) but the ones that raced poorly -- e.g. ``tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show`` and the other ``LocalClient``-driven integration tests in CI run 24641186096 / job 72046902967 -- hung for the full 90s. Drop the selector peek in the non-blocking path. Ensure a persistent read task is in flight, give the ioloop up to 10ms to satisfy it, and leave the task running across ``recv`` boundaries so cancellation can't corrupt ``IOStream._read_future`` (the original reason for the task rewrite). The read path itself -- ``_read_into_unpacker`` + ``_ensure_read_task`` -- is unchanged and already safe against the AssertionError: Already reading issue that motivated the earlier refactor. Verified locally: - all 10 originally-failing integration tests in job 72046902967 (test_sysctl/test_cp/test_mine/test_status/test_test/test_ext_modules) now pass under --run-slow - tests/pytests/unit/{channel,transport,client,crypt,utils/event}: 278 passed / 211 skipped - tests/pytests/functional/{channel,master,transport/{tcp,ipc,zeromq,ws}}: 37 passed / 53 skipped / 7 xfailed - tests/pytests/scenarios/{cluster,reauth,transport}: 2 passed / 1 skipped - tests/pytests/integration/{client,events,minion,master,grains,modules}: 50 passed / 12 skipped - black/isort clean; pylint -E clean vs baseline on modified file Made-with: Cursor

``fileclient.Client.get_url`` called ``ftp.connect(host, port)`` with no timeout. Python's ``ftplib.FTP.connect`` then calls ``socket.create_connection`` with no timeout, which walks ``getaddrinfo`` results in order and blocks until each address's TCP SYN exchange exhausts kernel retransmits (~2+ minutes) before falling through to the next result. ``ftp.freebsd.org`` now publishes both ``A`` and ``AAAA`` records. On hosts with a v6 address but no working upstream route to ``2001:5a8:601:4b::/48`` the SYN is silently dropped and the v6 attempt hangs indefinitely, so ``tests/integration/modules/test_cp.py`` ``::CPModuleTest::test_get_url_ftp`` -- and any real minion doing an ``ftp://`` ``cp.get_url`` against a dual-stack host -- stalls well past any reasonable job timeout. Pass an explicit timeout (configurable via the new ``fileserver_ftp_timeout`` opt, default 30s) so an unreachable address-family falls through to the next ``getaddrinfo`` result promptly and ``cp.get_url`` either succeeds via IPv4 or returns a clear ``MinionError`` rather than hanging the minion process. Locally, ``test_get_url_ftp`` and ``test_get_url_https`` now both pass in ~45s on a dual-stack box where they previously hung for the full test deadline. Made-with: Cursor

The ``selectors.DefaultSelector()`` peek in ``PublishClient.recv(timeout=0)`` was removed in bbce5b3, leaving the module-level ``import selectors`` as the only remaining reference. Remove it to satisfy pylint W0611 (unused-import). Made-with: Cursor

f95f354 dropped ``PublishClient.recv(timeout=0)``'s ``selectors.DefaultSelector`` peek -- it missed data that Tornado had already pulled into ``IOStream._read_buffer``, making ``LocalClient.get_returns_no_block`` miss job returns. 149f41d then dropped the ``import selectors`` that had become unused. ``test_recv_timeout_zero`` still patched ``salt.transport.tcp.selectors.DefaultSelector`` and asserted ``register``/``unregister`` calls on it, so after the rebase it now fails with ``ModuleNotFoundError: No module named 'salt.transport.tcp.selectors'``. Rewrite the test to exercise the new contract: ``recv(timeout=0)`` must return ``None`` when nothing is buffered and the in-flight read task does not complete within its short non-blocking wait, and must NOT cancel that task (cancelling corrupts ``tornado.iostream.IOStream._read_future``). The replacement stubs ``stream.read_bytes`` with a future that never completes, asserts ``recv(timeout=0)`` returns ``None``, and verifies the persistent ``_read_task`` is still alive afterwards. Made-with: Cursor

dwoz requested a review from a team as a code owner January 2, 2026 22:04

dwoz temporarily deployed to ci January 2, 2026 22:04 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 22:31 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 23:21 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 23:24 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 23:50 — with GitHub Actions Inactive

dwoz had a problem deploying to ci January 2, 2026 23:50 — with GitHub Actions Error

dwoz temporarily deployed to ci January 2, 2026 23:50 — with GitHub Actions Inactive

dwoz force-pushed the auto_scale branch from 59dfc28 to 9b12db4 Compare January 2, 2026 23:52

dwoz temporarily deployed to ci January 2, 2026 23:52 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 3, 2026 00:03 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 3, 2026 00:31 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 3, 2026 01:27 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 3, 2026 01:54 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 03:47 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 05:14 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 05:30 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 05:57 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 09:13 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 4, 2026 09:39 — with GitHub Actions Inactive

dwoz and others added 13 commits April 19, 2026 20:21

Fix black: add space after # in commented cluster_pki_dir check

96baeac

Pre-commit's black hook rejected ``#and`` (missing space after ``#``) on an inline commented-out clause in Master.verify_environment. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Master cluster join#68576

[wip] Master cluster join#68576
dwoz wants to merge 13 commits intosaltstack:masterfrom
dwoz:auto_scale

dwoz commented Jan 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dwoz commented Jan 2, 2026

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Merge requirements satisfied?

Commits signed with GPG?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant