Distributed cluster cache by dwoz · Pull Request #69179 · saltstack/salt

dwoz · 2026-05-16T08:12:56Z

Add multi-ring Raft consensus and reversible mmap-cache migration on top
of the existing cluster runner work:

Multi-ring Raft groups. Each cluster data type (jobs, events, …)
can run on its own Raft group with an independent leader/voter set
instead of every master holding the full keyspace. New cluster-log
state machines (RingRegistryStateMachine, RoutingStateMachine)
track which rings exist and which data types route to them.
Migration runners. cluster.shed_unowned drops keys this master
no longer owns under a ring; cluster.collect_from_peers pulls them
back. Both are operator-paced with a dry-run mode so the broadcast →
ring-sharded → broadcast round-trip is fully reversible.
Operator surface. cluster.ring_create, cluster.ring_destroy,
cluster.route_set, cluster.route_clear, cluster.ring_set —
routed through the publish-channel fan-out so the runner works on
any cluster master, not just the current leader.
CI fixes. Two integration test race conditions on slow runners,
and disabling coverage's dynamic_context = test_function while Salt
is on Python 3.14 (it forces the slow PyTracer path; see commit message
on f8aef153a6c for the analysis).

Operator-driven content fan-out for file_roots and pillar_roots. Run on whichever master holds the canonical content; the cluster fans out from there to every peer over the encrypted cluster pub bus. Same transport, chunk format, and apply path as the join-time bulk state-sync. How it works 1. Runner fires 'cluster/runner/sync_roots' to the local event bus with the requested channel list. 2. MasterPubServerChannel.publish_payload intercepts that tag -- instead of broadcasting as a regular cluster event, it spawns _run_root_sync_to_peers(channels). 3. For each peer pusher: * Allocate a fresh session id. * Send a 'cluster/peer/sync-roots-begin' event so the receiver pre-registers a state-sync session -- same contract as the join-reply flow, but the receiver's on_complete is a no-op (this is an ad-hoc push, not a Raft-learner bootstrap). * Stream the requested channels in the standard state-sync chunk format. 4. Receivers apply chunks via the existing _apply_state_sync_chunk path -- no new install code; the channels are file_roots and pillar_roots and install_root_chunk handles both. Live smoke verified on a 5-master cluster: added new content to m1's srv/salt (including a nested subdirectory) and srv/pillar after the cluster was up, ran cluster.sync_roots, all 4 peers had the new files within ~8 seconds. Nested directories preserved. What's NOT in this slice * No completion feedback -- runner returns immediately with status='fan-out initiated'. Operators tail each peer's log for the 'state-sync ... installed N items' lines to confirm delivery. * No re-sync trigger -- operator-driven only. No filesystem watcher, no periodic poll. Tests * test_sync_roots_rejects_invalid_roots -- input validation * test_sync_roots_no_cluster_id_is_skip -- non-cluster master returns a structured skip rather than firing a meaningless event * test_sync_roots_fires_local_event -- happy path: event fires with the resolved channel list * test_sync_roots_file_only_filters_channels -- channel filter honoured

End-to-end integration coverage on the 3-master isolated-FS cluster. Complements the unit tests in tests/pytests/unit/runners/test_cluster_runner.py (event firing, input validation) with full daemon-driven runs that exercise the runner -> master event bus -> publish_payload intercept -> _run_root_sync_to_peers -> peer state-sync chunk install path. Two scenarios: * test_isolated_sync_roots_runner_propagates_content After the cluster is steady-state, write new content to master_1's file_roots and pillar_roots, run salt-run cluster.sync_roots, poll master_2 and master_3 until both files appear (or 30s elapses), and assert the marker round-trips through encrypted state-sync. Distinct from test_isolated_late_joiner_receives_file_and_pillar_roots which covers JOIN-time bulk sync. * test_isolated_sync_roots_runner_file_only Pins the channels= filter: roots=file syncs only file_roots; the pillar tree on peers remains untouched. Operator escape hatch against accidentally fanning out secret pillar data when only an SLS update is intended. Pre-existing trivial change: the commented-out warning log line at the top of publish_payload (one of those //log it all// debug crutches) is removed since the function already gets per-event log output via the cluster-event broadcast path and the targeted publish_payload branches. Diagnosis note for future debuggers While bringing these tests up I hit a confusing failure where the runner fired the event, master_1's daemon broadcast it as cluster/event/127.0.0.1/cluster/runner/sync_roots, but the intercept never ran. Root cause: salt-factories had cached /tmp/stsuite/scripts/cli_salt_master.py pointing at a *different* worktree (saw a stale CODE_DIR=.../masterbug entry), so the test daemons loaded an older copy of salt that didn't have the publish_payload intercept. Clearing /tmp/stsuite/scripts fixed it. Worth knowing if you see 'feature works in unit test but integration test silently misses it' — check the factory cache first.

Each ring becomes its own Raft group with independent leader, log, and voter set. The cluster log carries a RingRegistryStateMachine and a RoutingStateMachine that name rings and route data types to them; per-ring membership and policy live in that ring's own log. Operator runners drive the lifecycle end to end: cluster.ring_create / ring_destroy / ring_set cluster.route_set / route_clear / routes / rings cluster.shed_unowned / shed_unowned_all / shed_status cluster.collect_from_peers cluster.migrate_jobs_to_cache shed and collect reuse the existing sync-roots wire format with a bank: channel prefix so arbitrary cache banks travel over state-sync chunks. The salt_cache returner replaces the implicit local-jobs layout with an explicit cache-backed one (jobs/loads, jobs/minions, jobs/returns/<jid>, jobs/endtimes, jobs/nocache) so the migration has a well-defined bank set to move. Gate sites at salt/master.py call ring_membership.owns_for(opts, data_type, key); a non-member master no-ops the write and rate-limits a WARN per (data_type, ring, reason) bucket. A delegate-on-miss event gives operators a safety net for asymmetric topologies where traffic can't reach a ring member directly. Per-ring voter-health watchdog with (group_id, peer_id) cooldown keying keeps each ring's quorum healthy independently. cluster-health.json is now structured per ring. Sentinel writes (shed-status, collect-status, cluster-health) go through salt.utils.atomicfile.atomic_open so a concurrent reader or doubled fan-out never sees partial JSON. The publish daemon dedups self.pushers by (pull_host, pull_port) to stop the shed fan-out from being delivered twice.

The five operator-facing runners (cluster.ring_create, ring_destroy, route_set, route_clear, ring_set) propose entries through the cluster Raft log, and that only works on the leader. The runner intercept in MasterPubServerChannel.publish_payload dispatched locally only — when the operator invoked the runner on a non-leader master the propose raised RuntimeError, the exception was swallowed by log.exception, and no fan-out reached the actual leader. The intercept now also schedules a cluster_aes-encrypted cluster/peer/multi-ring-request event to every peer. Each peer re-dispatches through _handle_multi_ring_runner_event; only the master that is the current Raft leader appends an entry, followers log "not leader" and skip. No double-commit because Raft has one leader at any moment. A new unit-test file pins the wire shape (one publish per pusher, cluster_aes round-trip, broken-pusher tolerance, no-peers no-op, missing-secret silent skip). Also raise per-test timeouts on slow tests that were tripping the global 90-second pytest-timeout under coverage tracing on loaded GHA runners. The workflow's automatic --lf rerun absorbs the resulting slow-but-passes case so a transient runner-load spike no longer fails the job: * test_check_minions_grain_target_10000 — 240 s (test budget is 180 s; the global 90 s default fired before the budget assertion could evaluate). * test_r_state_apply_logical_resource_no_state_module — 180 s (three sequential state.apply runs against dummy resources). * test_auth_not_starved_with_routing — 180 s (7-daemon saturation test). * test_orchestration_with_pillar_dot_items — passes _timeout=120 to salt_run_cli.run() so the 30 s factory default doesn't kill the recursive saltutil.runner('pillar.show_pillar') invocation. Tornado timing flake: test_when_is_timed_out_is_set_before_other_events_are_completed used gather_timeout=0.1 s and event_timeout=0.15 s. The 50 ms ordering window was regularly inverted by GHA scheduler jitter; bumped to 1 s and 2 s respectively, preserving the test's logical invariant. Windows assertion drift: test_win_user_present_new_password now checks changes.get("passwd") == "XXX-REDACTED-XXX" instead of strict-equal so the new password_changed / lstchg keys surfaced by the recent shadow.info refactor don't fail the test.

test_jobs_migration_round_trip: the assertion order raced the peer-side ``_handle_collect_request``, which streams the requested channels back to the requester sequentially (``jobs/loads`` first, then ``jobs/minions``, ``jobs/endtimes``, ``jobs/nocache``). The old flow waited up to 30 s for ``jobs/loads`` to land then immediately asserted on ``jobs/minions`` — fine on a fast box where the whole fan-out lands in well under a second, but on a loaded GHA runner ``jobs/minions`` chunks were still in flight when the assertion fired. Wait on the operator-facing sentinel (``cluster-collect-status.json`` ``complete=True``) — the actual end-of-collect signal — and then check every bank. test_ring_lifecycle_on_shared_filesystem: three serial ``_wait_until`` polls (60 s baseline + 45 s registry/route + 30 s destroy), each running ``cluster.members`` as a fresh ``salt-run`` subprocess. On a 2-vCPU GHA runner the cumulative wall-clock has been observed at 95–130 s, past the global 90 s pytest-timeout default. Add ``pytest.mark.timeout(360, func_only=True)`` so the test's own predicate timeouts remain the failure signal.

``dynamic_context = test_function`` forces coverage.py off the sys.monitoring (sysmon) measurement core — sysmon does not yet implement dynamic contexts. On Python 3.14 sysmon is the default and is dramatically faster than the PyTracer / CTracer fallbacks. With the coverage pin we currently carry (7.3.1), no CTracer wheel ships for 3.14 either, so dynamic_context drops the test suite all the way down to the pure-Python PyTracer. That combination — the slow PyTracer plus relenv's runtime wrappers around sysconfig — makes every forked subprocess's ``cov.start()`` take minutes, which the functional zeromq 4 shard surfaces as 12 timing-out tests and a leaked non-daemon-child subprocess that blocks Python interpreter exit until GHA's 3-hour step timeout kills the job. Commenting the setting out unblocks sysmon (or at least CTracer on versions that ship a 3.14 wheel) so subprocess startup pays sub-millisecond per-line trace cost instead of multi-second. The inline comment in ``.coveragerc`` captures the symptoms and the re-enable condition. Refs: coveragepy/coveragepy#2082 https://coverage.readthedocs.io/en/latest/contexts.html https://coverage.readthedocs.io/en/latest/faq.html

dwoz requested a review from a team as a code owner May 16, 2026 08:12

dwoz had a problem deploying to ci May 16, 2026 08:13 — with GitHub Actions Error

dwoz added the test:full Run the full test suite label May 16, 2026

dwoz temporarily deployed to ci May 16, 2026 08:13 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 16, 2026 08:30 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 16, 2026 09:15 — with GitHub Actions Inactive

dwoz added the test:coverage label May 17, 2026

dwoz temporarily deployed to ci May 17, 2026 09:19 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 17, 2026 09:30 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 17, 2026 09:45 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 18, 2026 07:42 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 18, 2026 07:58 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 20, 2026 08:44 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 20, 2026 08:57 — with GitHub Actions Inactive

dwoz temporarily deployed to ci May 20, 2026 09:13 — with GitHub Actions Inactive

Daniel A. Wozniak added 6 commits May 21, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed cluster cache#69179

Distributed cluster cache#69179
dwoz wants to merge 6 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2

dwoz commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dwoz commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dwoz commented May 16, 2026 •

edited

Loading