Skip to content

Distributed cluster cache#69179

Open
dwoz wants to merge 6 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2
Open

Distributed cluster cache#69179
dwoz wants to merge 6 commits into
saltstack:3008.xfrom
dwoz:cluster-max-voters-2

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented May 16, 2026

Add multi-ring Raft consensus and reversible mmap-cache migration on top
of the existing cluster runner work:

  • Multi-ring Raft groups. Each cluster data type (jobs, events, …)
    can run on its own Raft group with an independent leader/voter set
    instead of every master holding the full keyspace. New cluster-log
    state machines (RingRegistryStateMachine, RoutingStateMachine)
    track which rings exist and which data types route to them.

  • Migration runners. cluster.shed_unowned drops keys this master
    no longer owns under a ring; cluster.collect_from_peers pulls them
    back. Both are operator-paced with a dry-run mode so the broadcast →
    ring-sharded → broadcast round-trip is fully reversible.

  • Operator surface. cluster.ring_create, cluster.ring_destroy,
    cluster.route_set, cluster.route_clear, cluster.ring_set
    routed through the publish-channel fan-out so the runner works on
    any cluster master, not just the current leader.

  • CI fixes. Two integration test race conditions on slow runners,
    and disabling coverage's dynamic_context = test_function while Salt
    is on Python 3.14 (it forces the slow PyTracer path; see commit message
    on f8aef153a6c for the analysis).

@dwoz dwoz requested a review from a team as a code owner May 16, 2026 08:12
@dwoz dwoz added the test:full Run the full test suite label May 16, 2026
Daniel A. Wozniak added 6 commits May 21, 2026 01:26
Operator-driven content fan-out for file_roots and pillar_roots.
Run on whichever master holds the canonical content; the cluster
fans out from there to every peer over the encrypted cluster pub
bus.  Same transport, chunk format, and apply path as the join-time
bulk state-sync.

How it works

1. Runner fires 'cluster/runner/sync_roots' to the local event
   bus with the requested channel list.
2. MasterPubServerChannel.publish_payload intercepts that tag --
   instead of broadcasting as a regular cluster event, it spawns
   _run_root_sync_to_peers(channels).
3. For each peer pusher:
   * Allocate a fresh session id.
   * Send a 'cluster/peer/sync-roots-begin' event so the receiver
     pre-registers a state-sync session -- same contract as the
     join-reply flow, but the receiver's on_complete is a no-op
     (this is an ad-hoc push, not a Raft-learner bootstrap).
   * Stream the requested channels in the standard state-sync chunk
     format.
4. Receivers apply chunks via the existing _apply_state_sync_chunk
   path -- no new install code; the channels are file_roots and
   pillar_roots and install_root_chunk handles both.

Live smoke verified on a 5-master cluster: added new content to m1's
srv/salt (including a nested subdirectory) and srv/pillar after the
cluster was up, ran cluster.sync_roots, all 4 peers had the new
files within ~8 seconds.  Nested directories preserved.

What's NOT in this slice

* No completion feedback -- runner returns immediately with
  status='fan-out initiated'.  Operators tail each peer's log for
  the 'state-sync ... installed N items' lines to confirm delivery.
* No re-sync trigger -- operator-driven only.  No filesystem watcher,
  no periodic poll.

Tests

* test_sync_roots_rejects_invalid_roots -- input validation
* test_sync_roots_no_cluster_id_is_skip -- non-cluster master returns
  a structured skip rather than firing a meaningless event
* test_sync_roots_fires_local_event -- happy path: event fires with
  the resolved channel list
* test_sync_roots_file_only_filters_channels -- channel filter
  honoured
End-to-end integration coverage on the 3-master isolated-FS cluster.
Complements the unit tests in
tests/pytests/unit/runners/test_cluster_runner.py (event firing,
input validation) with full daemon-driven runs that exercise the
runner -> master event bus -> publish_payload intercept ->
_run_root_sync_to_peers -> peer state-sync chunk install path.

Two scenarios:

* test_isolated_sync_roots_runner_propagates_content
  After the cluster is steady-state, write new content to master_1's
  file_roots and pillar_roots, run salt-run cluster.sync_roots, poll
  master_2 and master_3 until both files appear (or 30s elapses),
  and assert the marker round-trips through encrypted state-sync.
  Distinct from test_isolated_late_joiner_receives_file_and_pillar_roots
  which covers JOIN-time bulk sync.

* test_isolated_sync_roots_runner_file_only
  Pins the channels= filter: roots=file syncs only file_roots; the
  pillar tree on peers remains untouched.  Operator escape hatch
  against accidentally fanning out secret pillar data when only an
  SLS update is intended.

Pre-existing trivial change: the commented-out warning log line at
the top of publish_payload (one of those //log it all// debug
crutches) is removed since the function already gets per-event log
output via the cluster-event broadcast path and the targeted
publish_payload branches.

Diagnosis note for future debuggers

While bringing these tests up I hit a confusing failure where the
runner fired the event, master_1's daemon broadcast it as
cluster/event/127.0.0.1/cluster/runner/sync_roots, but the
intercept never ran.  Root cause: salt-factories had cached
/tmp/stsuite/scripts/cli_salt_master.py pointing at a *different*
worktree (saw a stale CODE_DIR=.../masterbug entry), so the test
daemons loaded an older copy of salt that didn't have the
publish_payload intercept.  Clearing /tmp/stsuite/scripts fixed it.
Worth knowing if you see 'feature works in unit test but
integration test silently misses it' — check the factory cache
first.
Each ring becomes its own Raft group with independent leader, log, and
voter set. The cluster log carries a RingRegistryStateMachine and a
RoutingStateMachine that name rings and route data types to them; per-ring
membership and policy live in that ring's own log.

Operator runners drive the lifecycle end to end:

  cluster.ring_create / ring_destroy / ring_set
  cluster.route_set / route_clear / routes / rings
  cluster.shed_unowned / shed_unowned_all / shed_status
  cluster.collect_from_peers
  cluster.migrate_jobs_to_cache

shed and collect reuse the existing sync-roots wire format with a
bank: channel prefix so arbitrary cache banks travel over state-sync
chunks. The salt_cache returner replaces the implicit local-jobs
layout with an explicit cache-backed one (jobs/loads, jobs/minions,
jobs/returns/<jid>, jobs/endtimes, jobs/nocache) so the migration has
a well-defined bank set to move.

Gate sites at salt/master.py call ring_membership.owns_for(opts,
data_type, key); a non-member master no-ops the write and rate-limits
a WARN per (data_type, ring, reason) bucket. A delegate-on-miss event
gives operators a safety net for asymmetric topologies where traffic
can't reach a ring member directly.

Per-ring voter-health watchdog with (group_id, peer_id) cooldown keying
keeps each ring's quorum healthy independently. cluster-health.json is
now structured per ring. Sentinel writes (shed-status, collect-status,
cluster-health) go through salt.utils.atomicfile.atomic_open so a
concurrent reader or doubled fan-out never sees partial JSON. The
publish daemon dedups self.pushers by (pull_host, pull_port) to stop
the shed fan-out from being delivered twice.
The five operator-facing runners (cluster.ring_create, ring_destroy,
route_set, route_clear, ring_set) propose entries through the cluster
Raft log, and that only works on the leader.  The runner intercept in
MasterPubServerChannel.publish_payload dispatched locally only — when
the operator invoked the runner on a non-leader master the propose
raised RuntimeError, the exception was swallowed by log.exception, and
no fan-out reached the actual leader.

The intercept now also schedules a cluster_aes-encrypted
cluster/peer/multi-ring-request event to every peer.  Each peer
re-dispatches through _handle_multi_ring_runner_event; only the master
that is the current Raft leader appends an entry, followers log "not
leader" and skip.  No double-commit because Raft has one leader at any
moment.  A new unit-test file pins the wire shape (one publish per
pusher, cluster_aes round-trip, broken-pusher tolerance, no-peers
no-op, missing-secret silent skip).

Also raise per-test timeouts on slow tests that were tripping the
global 90-second pytest-timeout under coverage tracing on loaded GHA
runners.  The workflow's automatic --lf rerun absorbs the resulting
slow-but-passes case so a transient runner-load spike no longer fails
the job:

  * test_check_minions_grain_target_10000 — 240 s (test budget is
    180 s; the global 90 s default fired before the budget assertion
    could evaluate).
  * test_r_state_apply_logical_resource_no_state_module — 180 s
    (three sequential state.apply runs against dummy resources).
  * test_auth_not_starved_with_routing — 180 s (7-daemon
    saturation test).
  * test_orchestration_with_pillar_dot_items — passes _timeout=120
    to salt_run_cli.run() so the 30 s factory default doesn't kill
    the recursive saltutil.runner('pillar.show_pillar') invocation.

Tornado timing flake: test_when_is_timed_out_is_set_before_other_events_are_completed
used gather_timeout=0.1 s and event_timeout=0.15 s.  The 50 ms ordering
window was regularly inverted by GHA scheduler jitter; bumped to 1 s
and 2 s respectively, preserving the test's logical invariant.

Windows assertion drift: test_win_user_present_new_password now
checks changes.get("passwd") == "XXX-REDACTED-XXX" instead of
strict-equal so the new password_changed / lstchg keys surfaced by
the recent shadow.info refactor don't fail the test.
test_jobs_migration_round_trip: the assertion order raced the
peer-side ``_handle_collect_request``, which streams the requested
channels back to the requester sequentially (``jobs/loads`` first,
then ``jobs/minions``, ``jobs/endtimes``, ``jobs/nocache``).  The old
flow waited up to 30 s for ``jobs/loads`` to land then immediately
asserted on ``jobs/minions`` — fine on a fast box where the whole
fan-out lands in well under a second, but on a loaded GHA runner
``jobs/minions`` chunks were still in flight when the assertion fired.
Wait on the operator-facing sentinel (``cluster-collect-status.json``
``complete=True``) — the actual end-of-collect signal — and then check
every bank.

test_ring_lifecycle_on_shared_filesystem: three serial ``_wait_until``
polls (60 s baseline + 45 s registry/route + 30 s destroy), each
running ``cluster.members`` as a fresh ``salt-run`` subprocess.  On a
2-vCPU GHA runner the cumulative wall-clock has been observed at
95–130 s, past the global 90 s pytest-timeout default.  Add
``pytest.mark.timeout(360, func_only=True)`` so the test's own
predicate timeouts remain the failure signal.
``dynamic_context = test_function`` forces coverage.py off the
sys.monitoring (sysmon) measurement core — sysmon does not yet
implement dynamic contexts.  On Python 3.14 sysmon is the default
and is dramatically faster than the PyTracer / CTracer fallbacks.

With the coverage pin we currently carry (7.3.1), no CTracer wheel
ships for 3.14 either, so dynamic_context drops the test suite all
the way down to the pure-Python PyTracer.  That combination — the
slow PyTracer plus relenv's runtime wrappers around sysconfig —
makes every forked subprocess's ``cov.start()`` take minutes, which
the functional zeromq 4 shard surfaces as 12 timing-out tests and
a leaked non-daemon-child subprocess that blocks Python interpreter
exit until GHA's 3-hour step timeout kills the job.

Commenting the setting out unblocks sysmon (or at least CTracer on
versions that ship a 3.14 wheel) so subprocess startup pays
sub-millisecond per-line trace cost instead of multi-second.  The
inline comment in ``.coveragerc`` captures the symptoms and the
re-enable condition.

Refs:
  coveragepy/coveragepy#2082
  https://coverage.readthedocs.io/en/latest/contexts.html
  https://coverage.readthedocs.io/en/latest/faq.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:coverage test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant