[wip] tuneable worker pools by dwoz · Pull Request #68532 · saltstack/salt

dwoz · 2025-12-13T22:48:38Z

What does this PR do?

What issues does this PR fix or reference?

Fixes

Previous Behavior

Remove this section if not relevant

New Behavior

Remove this section if not relevant

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Docs
Changelog - https://docs.saltproject.io/en/master/topics/development/changelog.html
Tests written/updated

Commits signed with GPG?

Yes/No

This commit introduces the Worker Pool Routing architecture, allowing the Salt Master to route incoming requests to dedicated worker pools via IPC. This significantly improves master scaling and isolation for different workloads. Key architectural changes and fixes: - Implement `PoolRoutingChannel` as the core router, replacing the legacy routing mechanism when worker pools are enabled. - Add native Unix Domain Socket (IPC) support for TCP and WebSocket transports, eliminating the need for TCP IPC fallbacks. - Unify payload decryption and command extraction logic with `RequestRouter`. - Enforce `minimum_auth_version` in the routing layer to prevent downgrade attacks. Stability and compatibility fixes: - Fix Windows multiprocessing compatibility by resolving `RequestServer` and `PublishServer` pickling/unpickling errors during process spawning. - Resolve recursive routing loops in `MWorker` processes. - Increase SSL handshake timeouts in TCP and WS transports to prevent spurious CI failures under load. - Fix test isolation issues (e.g., `salt_minion_2` key cleanup). - Update test fixtures to explicitly disable worker pools where the internal `ReqServerChannel` methods are being tested directly. Made-with: Cursor

The OPTIMIZED_WORKER_POOLS preset and its gating option worker_pools_optimized were never wired up to real usage and the preset itself is incomplete (missing commands like _auth, _minion_event, and _file_envs, and no '*' catchall). Rather than fix a preset nobody enables, drop it entirely. Users who want named pools should define their own worker_pools with an explicit catchall or worker_pool_default. Made-with: Cursor

Clarify in docstrings that ReqServerChannel.factory() returns one of two mutually exclusive channel implementations, and that _auth runs exactly once regardless of which path is active: - Non-pooled path (worker_pools_enabled=False): ReqServerChannel.handle_message() intercepts _auth inline and handles it without involving a worker. - Pooled path (worker_pools_enabled=True): PoolRoutingChannel routes _auth like any other command to the mapped pool, where a worker dispatches it via ClearFuncs._auth. No inline interception happens here. Made-with: Cursor

Two tests in tests/pytests/functional/channel/test_worker_pool_starvation.py demonstrate the value of pool routing: - test_auth_starved_without_routing (xfail strict): with worker_pools disabled, a slow ext pillar saturates all workers and a new minion's _auth request times out. - test_auth_not_starved_with_routing: with a dedicated 'auth' pool mapped to _auth, the same saturation scenario does not block auth and the new minion authenticates within the timeout. Made-with: Cursor

…ressure salt/transport/tcp.py: PublishClient.recv used ``asyncio.wait_for`` to apply a timeout to an in-flight ``stream.read_bytes(...)``. On timeout the inner coroutine was cancelled, which left Tornado's IOStream._read_future set -- Tornado does not reset it on external cancellation. The next recv() call then hit ``assert self._read_future is None, "Already reading"`` in _start_read(). This broke test_minion_manager_async_stop and, in functional tests, caused test_minion_send_req_async to hang for 90s instead of honoring its 10s timeout. Rework recv() to keep a persistent per-client read task and wait on it with ``asyncio.wait`` (which does not cancel on timeout) or ``asyncio.shield`` (for the no-timeout branch). On timeout, the in-flight read is left running so the next recv() picks it up and no message is dropped. Clean up the task in close(). The existing non-blocking ``timeout=0`` semantics (peek + at most one read) are preserved. salt/transport/ipc.py: restore the backpressure check in IPCMessagePublisher.publish(): skip ``spawn_callback(self._write, ...)`` when the stream is already writing. Without it, pending write coroutines pile up in the event loop for slow/non-consuming subscribers, inflating EventPublisher RSS under high-frequency event firing (this is the fix from commit ed4b309 that was lost off this branch). Locally this drops test_publisher_mem peak from 120 MB to 90 MB with ~0.5 MB growth over a 60s publish run. Made-with: Cursor

``PublishClient.recv(timeout=0)`` gated the read on a raw-socket ``selectors.DefaultSelector`` peek. That misses data that Tornado has already pulled off the kernel socket into ``IOStream._read_buffer``: ``read_bytes`` would return immediately, but the kernel-level select says "not readable" because the socket has already been drained. Every subsequent ``recv(timeout=0)`` then returned ``None`` even though the event was sitting right there, ready to be decoded. ``LocalClient.get_returns_no_block`` polls with ``recv(timeout=0)``, so the symptom was job return events intermittently never reaching ``LocalClient``, leaving ``get_iter_returns`` to spin until its ~90s deadline. Most tests still passed (the event made it through before Tornado had a chance to pre-buffer) but the ones that raced poorly -- e.g. ``tests/integration/modules/test_sysctl.py::SysctlModuleTest::test_show`` and the other ``LocalClient``-driven integration tests in CI run 24641186096 / job 72046902967 -- hung for the full 90s. Drop the selector peek in the non-blocking path. Ensure a persistent read task is in flight, give the ioloop up to 10ms to satisfy it, and leave the task running across ``recv`` boundaries so cancellation can't corrupt ``IOStream._read_future`` (the original reason for the task rewrite). The read path itself -- ``_read_into_unpacker`` + ``_ensure_read_task`` -- is unchanged and already safe against the AssertionError: Already reading issue that motivated the earlier refactor. Verified locally: - all 10 originally-failing integration tests in job 72046902967 (test_sysctl/test_cp/test_mine/test_status/test_test/test_ext_modules) now pass under --run-slow - tests/pytests/unit/{channel,transport,client,crypt,utils/event}: 278 passed / 211 skipped - tests/pytests/functional/{channel,master,transport/{tcp,ipc,zeromq,ws}}: 37 passed / 53 skipped / 7 xfailed - tests/pytests/scenarios/{cluster,reauth,transport}: 2 passed / 1 skipped - tests/pytests/integration/{client,events,minion,master,grains,modules}: 50 passed / 12 skipped - black/isort clean; pylint -E clean vs baseline on modified file Made-with: Cursor

The selector-peek path in PublishClient.recv(timeout=0) was removed in the previous commit; the `import selectors` is now unused and tripped pylint W0611. Made-with: Cursor

The previous commit dropped the ``import selectors`` that was the last user of ``salt.transport.tcp.selectors``. ``test_recv_timeout_zero`` still patched ``salt.transport.tcp.selectors.DefaultSelector`` and asserted register/unregister calls on it, so it now fails with ``ModuleNotFoundError: No module named 'salt.transport.tcp.selectors'`` across every Linux ``unit zeromq 4`` CI job. Rewrite the test to exercise the new contract: ``recv(timeout=0)`` must return ``None`` when nothing is buffered and the in-flight read task does not complete within its short non-blocking wait, and must NOT cancel that task (cancelling corrupts ``tornado.iostream.IOStream._read_future``). The replacement stubs ``stream.read_bytes`` with a future that never completes, asserts ``recv(timeout=0)`` returns ``None``, and verifies the persistent ``_read_task`` is still alive afterwards. Made-with: Cursor

dwoz requested a review from a team as a code owner December 13, 2025 22:48

dwoz had a problem deploying to ci December 13, 2025 22:48 — with GitHub Actions Error

dwoz added the test:full Run the full test suite label Dec 13, 2025

dwoz temporarily deployed to ci December 13, 2025 22:49 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 13, 2025 22:56 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 13, 2025 23:14 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 14, 2025 05:02 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 14, 2025 05:11 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 14, 2025 05:28 — with GitHub Actions Inactive

dwoz temporarily deployed to ci December 31, 2025 22:41 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 09:50 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 10:12 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 22:10 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 22:19 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 22:37 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 23:38 — with GitHub Actions Inactive

dwoz temporarily deployed to ci January 2, 2026 23:47 — with GitHub Actions Inactive

dwoz added 8 commits April 18, 2026 23:08

Drop unused selectors import in tcp.py

ff53887

The selector-peek path in PublishClient.recv(timeout=0) was removed in the previous commit; the `import selectors` is now unused and tripped pylint W0611. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] tuneable worker pools#68532

[wip] tuneable worker pools#68532
dwoz wants to merge 8 commits intosaltstack:masterfrom
dwoz:tunnable-mworkers

dwoz commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dwoz commented Dec 13, 2025

What does this PR do?

What issues does this PR fix or reference?

Previous Behavior

New Behavior

Merge requirements satisfied?

Commits signed with GPG?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant