Skip to content

WIP: The great config bootstrap fix#30396

Open
WillemKauf wants to merge 14 commits intoredpanda-data:devfrom
WillemKauf:config_bootstrap_fix
Open

WIP: The great config bootstrap fix#30396
WillemKauf wants to merge 14 commits intoredpanda-data:devfrom
WillemKauf:config_bootstrap_fix

Conversation

@WillemKauf
Copy link
Copy Markdown
Contributor

No tests yet, opening for CI.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

Bug Fixes

  • Fixes a bug in which stale configuration values could be read by joining or restarting redpanda nodes during the bootstrapping process.

Copilot AI review requested due to automatic review settings May 6, 2026 22:03
@WillemKauf WillemKauf requested a review from a team as a code owner May 6, 2026 22:03
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 4a97a52 to 4755fa7 Compare May 6, 2026 22:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reworks Redpanda’s startup/bootstrap sequencing to prevent early reads of stale cluster configuration during node join/restart. It introduces a “config readiness” gate for config::shard_local_cfg(), splits kvstore recovery from writer startup, and adds an RPC path to fetch a leader-authoritative controller snapshot for restarting nodes.

Changes:

  • Add a readiness check to config::shard_local_cfg() plus an *_unsafe() escape hatch for early bootstrap.
  • Split kvstore recovery into an explicit phase, and restructure application bootstrap to recover/apply snapshots before marking config “ready”.
  • Add a fetch_controller_snapshot RPC and controller snapshot plumbing to support leader-authoritative bootstrap for restarting nodes.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/v/storage/storage_resources.h Extend ctor signature and add API to update append chunk size during bootstrap.
src/v/storage/storage_resources.cc Plumb append chunk size into storage_resources initialization and add updater method.
src/v/storage/kvstore.h Make recover() a public entry point and add start(recover_t) mode switch.
src/v/storage/kvstore.cc Implement split recovery/start flow; adjust read APIs to require recovery rather than full start.
src/v/storage/api.h Construct kvstore earlier and add recover_kvstore() bootstrap hook before start().
src/v/redpanda/BUILD Add Bazel deps needed by updated application/bootstrap compilation units.
src/v/redpanda/application.h Reshape bootstrap APIs into async helpers for applying snapshots/config and discovery.
src/v/redpanda/application.cc Move cluster-config hydration earlier and ensure feature-table metrics are set up during metrics init.
src/v/redpanda/application_start.cc Use stored cluster_discovery instance and remove passing discovery by reference into runtime start.
src/v/redpanda/application_config.cc Use shard_local_cfg_unsafe() for bootstrap-time kvstore config extraction.
src/v/redpanda/application_bootstrap.cc Rework bootstrap sequence: recover kvstore, apply snapshots, discovery/join, then mark config ready.
src/v/features/feature_table.h Add setup_metrics() API.
src/v/features/feature_table.cc Defer probe/metrics setup from ctor to explicit setup_metrics().
src/v/config/node_config.cc Use shard_local_cfg_unsafe() during node-config YAML loading/ignores.
src/v/config/configuration.h Add readiness flag + document shard_local_cfg() vs shard_local_cfg_unsafe().
src/v/config/configuration.cc Implement shard_local_cfg_unsafe() and guard shard_local_cfg() with readiness vassert.
src/v/cluster/types.h Add serde RPC request/reply types for fetching a controller snapshot.
src/v/cluster/service.h Add controller service RPC endpoint for fetch_controller_snapshot.
src/v/cluster/service.cc Implement RPC handler to delegate snapshot fetch logic to members manager.
src/v/cluster/members_manager.h Declare handler to produce/forward leader-authoritative controller snapshot replies.
src/v/cluster/members_manager.cc Implement snapshot fetch path (forward to leader if needed; leader builds partial join snapshot).
src/v/cluster/controller.json Register new fetch_controller_snapshot RPC in controller protocol schema.
src/v/cluster/controller.cc Adjust controller start wiring after config_manager ctor signature change.
src/v/cluster/controller_stm.h Generalize join snapshot construction via templated backend selection.
src/v/cluster/controller_stm.cc Reorder snapshot application so config manager applies earlier; remove old join snapshot impl.
src/v/cluster/config_manager.h Remove cluster recovery table dependency from config manager constructor/state.
src/v/cluster/config_manager.cc Switch more bootstrap-time reads to shard_local_cfg_unsafe() and adjust replay/bootstrap flow.
src/v/cluster/cluster_discovery.h Update discovery ctor signature and add controller snapshot fetch API (with stale-comment mismatch).
src/v/cluster/cluster_discovery.cc Implement new controller snapshot fetch RPC path and decouple discovery from storage::api.
Comments suppressed due to low confidence (1)

src/v/cluster/config_manager.cc:193

  • config_manager::apply_local() still stages updates to needs_restart properties via set_pending_value(...), but config_manager::start() no longer promotes pending values after STM replay. This means nodes may never apply needs_restart config updates even after a restart (pending values remain invisible indefinitely). Reintroduce a promotion point after replay (as before), or otherwise ensure pending values are promoted at least once during startup after replay completes.
ss::future<> config_manager::start() {
    if (_seen_version == config_version_unset) {
        vlog(clusterlog.trace, "Starting config_manager... (initial)");

        // Background: wait till we have a leader. If this node is leader

Comment thread src/v/storage/api.h Outdated
Comment thread src/v/storage/kvstore.cc
Comment on lines +104 to +108
// Flushing background fiber
ssx::spawn_with_gate(_gate, [this] {
return ss::do_until(
[this] { return _gate.is_closed(); },
[this] {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh probably unnecessary

// Accessors for the shard local cluster configuration.
// shard_local_cfg() contains a vassert which strictly enforces that the
// configuration has been marked as ready by the bootstrap process after we have
// successfully preloaded from our node local state, and furthermore, recieved
Comment thread src/v/cluster/cluster_discovery.h Outdated
Comment on lines +142 to +144
// Static because the call doesn't need any cluster_discovery
// instance state — it dispatches via a one-shot RPC client and
// reads the seed-server list from global node config.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch 4 times, most recently from 72cd608 to 0140ac7 Compare May 7, 2026 16:10
@vbotbuildovich

This comment was marked as outdated.

@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch 2 times, most recently from 2615bf0 to 5b59c01 Compare May 8, 2026 00:41
WillemKauf added 11 commits May 7, 2026 21:16
Pull config_manager::apply_snapshot out of the ss::when_all parallel
block and run it serially after feature_backend and before
members_manager. Downstream backends (members, topics, plugin,
recovery, security, quota, migrations, cluster_link) may consult
shard_local_cfg during their own apply_snapshot work; sequencing
config_manager first guarantees they see fresh values.
Removes the promote_all_pending() call from config_manager::start().
Pending values that accumulated during the post-bootstrap controller
log replay (i.e. cluster_config_delta_cmd records applied above the
hydrated snapshot offset) now stay pending instead of being promoted
when config_manager::start runs.

This is the strict reading of needs_restart=yes semantics: a property
change visible to a node only after it restarts. shard_local_cfg() is
already hydrated from the controller snapshot before bootstrap
proceeds (see application::bootstrap_cluster_config_view), so any
delta replayed afterward is a change the cluster made *after* the
hydration point and should not retroactively affect the running
node's active values for needs_restart properties.
We will want to be able to use the `kvstore` for read only purposes
early in the bootstrap process to recover important persisted state.

Expose `kvstore::recover()` as a way to allow read-only users access
to its state. Also expose an accessor higher up in `storage::api`.
The `kvstore` should be available after constructor is called for ease
of access.
…covery`

We were only using this in order to access the `node_uuid` and `cluster_uuid`,
which are constant after they are first set.

Simply pass these values directly instead of passing the entire `storage::api`
into the constructor.
We are going to want control over what backends we need to include
in a `controller_stm` snapshot in a future commit. Move this function
to the header and use a variadic template for this purpose.
A new RPC to fetch a controller snapshot, containing the latest in-memory
state of the controller leader's `config_manager` and `feature_backend`.

This RPC will be called unconditionally during bootstrap of a `redpanda`
broker, whether that node is simply restarting or is joining for the first
time. The reason for this being that a local node who is down for a long
time will have only stale configurations and features in its local
cache/snapshots/log, and the configuration and feature managers are two global
pieces of state that should be brought to a consistent view early on in a
`redpanda` node's lifetime.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch 3 times, most recently from 895f588 to 3ec0835 Compare May 8, 2026 02:00
WillemKauf added 2 commits May 7, 2026 22:09
The start-up process for `redpanda` now looks like this:

1. Cluster configs are hydrated from the local `config_cache` and legacy
   `node_cfg_yaml`.
2. `storage::api` is constructed (but not started), and the local `kvstore`
   is recovered.
3. We bootstrap from the `kvstore`:
   1. Apply the local feature table snapshot (potentially stale) from the `kvstore`.
   2. We attempt to load any existing cluster/node UUIDs from the `kvstore`.
4. We perform cluster discovery - if we are a joiner, we will perform our RPC
   to register with the cluster, and collect the `controller_join_snapshot`
   this way. If we are simply a restarter, this is mostly a no-op.
5. We bootstrap our view of the controller - if we were a joiner, we already
   have our required snapshot. If we are a restarter, we utilize the new RPC
   to fetch the controller snapshot from the leader. Unlike the first time
   registration RPC, we do permit this to fail, and will operate with stale
   configs/feature tables during bootstrap. If we have a controller snapshot
   (obtained in either aforementioned way), it is applied to the feature table,
   and the configuration is preloaded.
6. We now have a consistent view with the rest of the cluster of the feature
   table and cluster configuration for the rest of the bootstrapping process.

raft0 replay still happens _after_ the rest of the system is started.

A big caveat with cluster recovery - `needs_restart` properties fetched can no
longer be applied until the recovered node has restarted. There is no clear way
to make this _not_ the behavior.
And swap any uses of `shard_local_cfg()` which occur in the bootstrap
process (before it is marked as `ready`) over to the unsafe accessor.
These uses of potentially stale values are all audited and deemed safe,
due to minimal or no impact on the behavior of the system post bootstrap.

Two other relevant changes in this commit:
* Pulling out `feature_table::setup_metrics()` into its own function,
  called after the bootstrap process (in order to properly respect
  `shard_local_cfg().disable_{public}_metrics`)
* Updating `storage_resources` to _not_ invoke the `internal::chunks()`
  singleton constructor, which would in turn invoke `shard_local_cfg()`.
  We use `storage_resources for an early replay of the `kvstore` segments,
  so it is fine to use stale configuration values here. However, after
  bootstrapping is complete, it is *probably* important to set the
  `_append_chunk_size` back to the same value as what `internal::chunks()`
  is operating with. We now do so when initializing the storage system in
  `wire_up_bootstrap_services()`.
@WillemKauf WillemKauf force-pushed the config_bootstrap_fix branch from 3ec0835 to 2058bd9 Compare May 8, 2026 02:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants