compute: sync controller config before create-time setup#37294
Conversation
A replica creates its `ComputeState` when it handles `CreateInstance`, seeding `worker_config` with dyncfg defaults. `handle_create_instance` then immediately runs `apply_worker_config`, but the controller's synced configuration only arrives one command later in the first `UpdateConfiguration`. Both the live command stream and the reconciliation order place `UpdateConfiguration` after `CreateInstance`, and there is no `ComputeState` to apply configuration to before `CreateInstance`. So all create-time setup that reads configuration (lgalloc, the pager, the column-paged batcher, the memory limiter, `ore_overflowing_behavior`, and the introspection dataflows rendered during `CreateInstance`) silently used defaults. Carry the configuration with `CreateInstance` instead. `InstanceConfig` gains an `initial_config` snapshot of the controller's dyncfg with the replica's scoped overrides applied on top. The controller builds it in `specialize_command_for_replica`, the same layer that injects scoped overrides into `UpdateConfiguration`, evaluated fresh on every send and on every replay so it always reflects current values. The replica applies it to `worker_config` before `apply_worker_config`. The snapshot is excluded from `InstanceConfig::compatible_with`, so reconnecting to a running replica does not halt on a configuration that has since changed, and an empty snapshot leaves the defaults in place. Adds unit tests in the compute controller for the snapshot and override merge, and a clusterd-test-driver scenario plus parse test exercising the create-time snapshot plumbing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document at the create-instance call site that the first UpdateConfiguration is still required after folding the dyncfg snapshot into CreateInstance, because it carries the rest of ComputeParameters and syncs the persist config and metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
antiguru
left a comment
There was a problem hiding this comment.
Reviewed the change end to end. The fix is correct and well-scoped, with a sound soundness argument. One non-blocking question below.
Correctness
Traced the three paths the soundness argument depends on:
- Fresh startup / replica restart (
server.rs,compute_stateisNone):todo_commands = new_commands, sohandle_create_instanceruns and the newinitial_config.apply(&worker_config)seeds the worker beforeapply_worker_config. The followingUpdateConfigurationre-applies idempotently. ✅ - environmentd reconnect to a running replica (
server.rs:644):CreateInstanceis only compatibility-checked, never pushed totodo_commands, sohandle_create_instancedoes not re-run and the already-syncedworker_configis retained. ✅ compatible_withexclusion:initial_configis destructured to_and excluded, mirroringarrangement_dictionary_compression. Without this a reconnect after any config change wouldhalt!. ✅
Other checks:
- Both the controller
dyncfgand the workerworker_configare built frommz_dyncfgs::all_dyncfgs(), so the dense snapshot fromFrom<&ConfigSet>only contains configs the worker knows. No spurious Sentry errors from the unknown-config path inConfigUpdates::apply. sendstores the base (un-specialized) command in history and specializes per-replica at dispatch;add_replicare-specializes from history. The snapshot always reflects current dyncfg + override, never a stale baked value. The override-wins merge matches the existingUpdateConfigurationsemantics.- The controller's
dyncfgis kept current viaupdate_configuration, so by the timeCreateInstanceis sent the snapshot carries synced values rather than defaults. InstanceConfiguses serde (no protobuf) andConfigUpdatesalready derivesSerialize/Deserialize, so the new field is wire-compatible with no proto changes. All call sites updated.
Tests
Good coverage: three controller unit tests (instance-wide snapshot, override-wins, unchanged UpdateConfiguration merge), a parse test, and an end-to-end clusterd-test-driver scenario.
Question (non-blocking)
Persist config / metrics are not seeded at create time. handle_create_instance applies the snapshot only to worker_config. handle_update_configuration additionally does persist_clients.cfg().apply_from(...) and updates metrics, neither of which rides in CreateInstance (the controller.rs comment acknowledges this). Can you confirm that no create-time setup reads a persist-side dyncfg? Anything fetched from persist_clients.cfg() while handling CreateInstance would still observe defaults. If create-time reads only touch worker_config, this is fine as-is, but it's the one remaining gap in an otherwise complete fix.
Generated by Claude Code
|
Re: persist config / metrics at create time. Verified that no create-time read touches
Every So persist-side dyncfg is synced before anything reads it. Leaving create-time seeding scoped to |
Motivation
A replica creates its
ComputeStatewhen it handlesCreateInstance, seedingworker_configwith dyncfg defaults.handle_create_instancethen immediately runsapply_worker_config, but the controller's synced configuration only arrives one command later, in the firstUpdateConfiguration. Both the live command stream (Instance::runsendsHellothenCreateInstancebefore the firstupdate_configuration) and the reconciliation order (ComputeCommandHistory::reduceplaces the singleUpdateConfigurationafterCreateInstance) put configuration after creation. Reordering is not viable either: there is noComputeState/worker_configto apply configuration to beforeCreateInstance.As a result, every create-time read that depends on configuration silently used defaults: lgalloc, the pager, the column-paged batcher, the memory limiter,
ore_overflowing_behavior, and the introspection dataflows rendered duringCreateInstance. This is also why create-time-frozen scoped flags from #37158 had to be worked around.Description
Carry the configuration with
CreateInstance.InstanceConfiggains aninitial_configsnapshot of the controller's dyncfg with the replica's scoped overrides applied on top. The controller builds it inInstance::specialize_command_for_replica, the same layer that already injects scoped overrides intoUpdateConfiguration, and it has the livedyncfgand override map there. It is evaluated fresh on every send and on everyadd_replicareplay, so the snapshot always reflects current values. The replica applies it toworker_configbeforeapply_worker_config.The snapshot is excluded from
InstanceConfig::compatible_with, like dictionary compression, so reconnecting to a running replica does not halt on configuration that has changed since creation. An empty snapshot leaves the worker at its defaults until the firstUpdateConfiguration, preserving prior behavior.Soundness:
CreateInstance, seedingworker_configfrom the snapshot before create-time setup.environmentdreconnect to a running replica: reconcile only compatibility-checksCreateInstance; the existing already-syncedworker_configis retained and the replayedUpdateConfigurationkeeps it current.UpdateConfigurationstill applies globally and remains hoistable; the snapshot is a redundant create-time seed of the same values, not a new ordering dependency.Tests
Adds unit tests in the compute controller covering the create-time snapshot and the scoped-override merge, and confirming the existing
UpdateConfigurationoverride merge is unchanged. Adds aclusterd-test-driverscenario and a parse test exercising the create-time snapshot plumbing end to end.Checklist