refactor(runtime): extract VirtualClusterManager from KafkaProxy#87
Merged
SamBarker merged 6 commits intofeat/vc-lifecyclefrom Apr 13, 2026
Merged
refactor(runtime): extract VirtualClusterManager from KafkaProxy#87SamBarker merged 6 commits intofeat/vc-lifecyclefrom
SamBarker merged 6 commits intofeat/vc-lifecyclefrom
Conversation
9 tasks
9753fe1 to
88a481e
Compare
9f37c9a to
08273d5
Compare
Uzziee
approved these changes
Apr 13, 2026
Uzziee
left a comment
There was a problem hiding this comment.
Thanks @SamBarker .
Is the plan to add the restartVirtualCluster, removeVirtualCluster, addVirtualCluster methods to this VCM in the hot reload PR ?
Owner
Author
Yes exactly! We may do away with the constructor injection at that point too, but maybe not. |
3bf63c2 to
a44e5b8
Compare
Introduces VirtualClusterManager as the single source of truth for which virtual clusters exist and their current lifecycle state. VCM owns the cluster config tree and lifecycle managers but does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy. The onVirtualClusterStopped callback notifies the owner when a VC reaches terminal Stopped state, allowing proxy-level policy decisions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause) to VirtualClusterManager. On failure, the VC transitions immediately from Failed to Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with the failure cause. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager. transitionAllToStopped() returns boolean indicating whether all clusters are now stopped, so KafkaProxy doesn't need to poll individual lifecycle managers. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
KafkaProxy now delegates virtual cluster lifecycle management to VirtualClusterManager instead of managing lifecycle managers directly. Removes private transitionAllToDraining/transitionAllToStopped methods and the lifecycleManagers map in favour of VCM. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Move VCM construction from startup() to the constructor so it can be final and non-nullable, eliminating null guards throughout. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Log at WARN with a message that reflects the serve:none policy intent. Add TODO noting the callback should eventually drive proxy shutdown rather than relying on the caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
08273d5 to
6a56c6d
Compare
SamBarker
added a commit
that referenced
this pull request
Apr 13, 2026
* Add VirtualClusterManager with construction, accessors, and tests Introduces VirtualClusterManager as the single source of truth for which virtual clusters exist and their current lifecycle state. VCM owns the cluster config tree and lifecycle managers but does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy. The onVirtualClusterStopped callback notifies the owner when a VC reaches terminal Stopped state, allowing proxy-level policy decisions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add per-VC lifecycle transitions with callback Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause) to VirtualClusterManager. On failure, the VC transitions immediately from Failed to Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with the failure cause. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add bulk shutdown transitions and return value Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager. transitionAllToStopped() returns boolean indicating whether all clusters are now stopped, so KafkaProxy doesn't need to poll individual lifecycle managers. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Wire VirtualClusterManager into KafkaProxy KafkaProxy now delegates virtual cluster lifecycle management to VirtualClusterManager instead of managing lifecycle managers directly. Removes private transitionAllToDraining/transitionAllToStopped methods and the lifecycleManagers map in favour of VCM. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Make VirtualClusterManager non-nullable in KafkaProxy Move VCM construction from startup() to the constructor so it can be final and non-nullable, eliminating null guards throughout. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Improve onVirtualClusterStopped callback logging Log at WARN with a message that reflects the serve:none policy intent. Add TODO noting the callback should eventually drive proxy shutdown rather than relying on the caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker
added a commit
that referenced
this pull request
Apr 14, 2026
* Add VirtualClusterManager with construction, accessors, and tests Introduces VirtualClusterManager as the single source of truth for which virtual clusters exist and their current lifecycle state. VCM owns the cluster config tree and lifecycle managers but does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy. The onVirtualClusterStopped callback notifies the owner when a VC reaches terminal Stopped state, allowing proxy-level policy decisions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add per-VC lifecycle transitions with callback Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause) to VirtualClusterManager. On failure, the VC transitions immediately from Failed to Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with the failure cause. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add bulk shutdown transitions and return value Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager. transitionAllToStopped() returns boolean indicating whether all clusters are now stopped, so KafkaProxy doesn't need to poll individual lifecycle managers. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Wire VirtualClusterManager into KafkaProxy KafkaProxy now delegates virtual cluster lifecycle management to VirtualClusterManager instead of managing lifecycle managers directly. Removes private transitionAllToDraining/transitionAllToStopped methods and the lifecycleManagers map in favour of VCM. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Make VirtualClusterManager non-nullable in KafkaProxy Move VCM construction from startup() to the constructor so it can be final and non-nullable, eliminating null guards throughout. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Improve onVirtualClusterStopped callback logging Log at WARN with a message that reflects the serve:none policy intent. Add TODO noting the callback should eventually drive proxy shutdown rather than relying on the caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker
added a commit
that referenced
this pull request
Apr 15, 2026
* Add VirtualClusterManager with construction, accessors, and tests Introduces VirtualClusterManager as the single source of truth for which virtual clusters exist and their current lifecycle state. VCM owns the cluster config tree and lifecycle managers but does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy. The onVirtualClusterStopped callback notifies the owner when a VC reaches terminal Stopped state, allowing proxy-level policy decisions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add per-VC lifecycle transitions with callback Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause) to VirtualClusterManager. On failure, the VC transitions immediately from Failed to Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with the failure cause. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add bulk shutdown transitions and return value Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager. transitionAllToStopped() returns boolean indicating whether all clusters are now stopped, so KafkaProxy doesn't need to poll individual lifecycle managers. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Wire VirtualClusterManager into KafkaProxy KafkaProxy now delegates virtual cluster lifecycle management to VirtualClusterManager instead of managing lifecycle managers directly. Removes private transitionAllToDraining/transitionAllToStopped methods and the lifecycleManagers map in favour of VCM. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Make VirtualClusterManager non-nullable in KafkaProxy Move VCM construction from startup() to the constructor so it can be final and non-nullable, eliminating null guards throughout. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Improve onVirtualClusterStopped callback logging Log at WARN with a message that reflects the serve:none policy intent. Add TODO noting the callback should eventually drive proxy shutdown rather than relying on the caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker
added a commit
that referenced
this pull request
Apr 16, 2026
…ious#3640) * Add virtual cluster lifecycle state model Introduce VirtualClusterLifecycleState as a sealed interface with records for each state (Initializing, Serving, Draining, Failed, Stopped). Each state exposes typed transition methods enforcing the state machine from proposal 016. Failed carries its cause, and Stopped retains the prior failure cause for diagnostics. VirtualClusterLifecycleManager wraps the state in an AtomicReference for thread-safe transitions, with structured logging on each transition. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Wire lifecycle manager into KafkaProxy startup and shutdown Create a VirtualClusterLifecycleManager per virtual cluster during startup, transitioning each to Serving on success. On shutdown, beginShutdown() transitions serving clusters to Draining before the Netty graceful shutdown, then completeShutdown() transitions them to Stopped afterwards. Existing fail-fast behaviour is preserved. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Address review feedback on lifecycle manager - Remove beginShutdown/completeShutdown composite methods from VirtualClusterLifecycleManager; KafkaProxy now calls startDraining, drainComplete, and stop directly - Call initializationFailed(cause) from KafkaProxy startup catch block so Failed state carries the real failure cause - Fix race in transition(): use transitionFn.apply(previous) instead of state.get() to avoid observing another thread's update - Log at DEBUG for most transitions, INFO only for initial and terminal states Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Remove unused toInitializing() transitions from Draining and Failed These transitions are valid in the proposal's state machine but have no callers today — reload/recovery is deferred. Removing to avoid premature API commitment; they can be re-added when reload is implemented. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Remove unnecessary throws declaration from test method Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Address review feedback: fix double-apply, log levels, and Initializing→Stopped - Fix transition() to use get()+updateAndGet() instead of re-applying the transition function to derive the new state for logging - Log all transitions at INFO, Failed transitions at WARN - Support Initializing→Stopped transition so VCs are not orphaned if shutdown occurs during startup - Assert failure cause on Stopped state in startup failure test Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Replace AtomicReference with synchronized in VirtualClusterLifecycleManager The AtomicReference approach had a race between state.get() and state.updateAndGet() where another thread could advance the state, causing incorrect log messages. Using synchronized makes the read-transition-log sequence atomic. This is sufficient for the single-shot proxy lifecycle where contention is not a concern. Also adjusts log levels: INFO for significant states (Serving, Failed, Stopped), DEBUG for intermediate transitions (Draining). Closes kroxylicious#3696 Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * refactor(runtime): extract VirtualClusterManager from KafkaProxy (#87) * Add VirtualClusterManager with construction, accessors, and tests Introduces VirtualClusterManager as the single source of truth for which virtual clusters exist and their current lifecycle state. VCM owns the cluster config tree and lifecycle managers but does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy. The onVirtualClusterStopped callback notifies the owner when a VC reaches terminal Stopped state, allowing proxy-level policy decisions. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add per-VC lifecycle transitions with callback Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause) to VirtualClusterManager. On failure, the VC transitions immediately from Failed to Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with the failure cause. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Add bulk shutdown transitions and return value Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager. transitionAllToStopped() returns boolean indicating whether all clusters are now stopped, so KafkaProxy doesn't need to poll individual lifecycle managers. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Wire VirtualClusterManager into KafkaProxy KafkaProxy now delegates virtual cluster lifecycle management to VirtualClusterManager instead of managing lifecycle managers directly. Removes private transitionAllToDraining/transitionAllToStopped methods and the lifecycleManagers map in favour of VCM. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Make VirtualClusterManager non-nullable in KafkaProxy Move VCM construction from startup() to the constructor so it can be final and non-nullable, eliminating null guards throughout. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Improve onVirtualClusterStopped callback logging Log at WARN with a message that reflects the serve:none policy intent. Add TODO noting the callback should eventually drive proxy shutdown rather than relying on the caller. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Address review feedback: naming conventions, javadoc, and logging - Rename getState()/getClusterName() to state()/clusterName() on VirtualClusterLifecycleManager per project accessor conventions - Rename getVirtualClusterModels() to virtualClusterModels() on VirtualClusterManager - Rename terse field 'vcm' to 'virtualClusterManager' in KafkaProxy - Fix transitionAllToDraining() javadoc: clarify that Failed/Stopped clusters are skipped, not transitioned by this method - Fix logging key from 'cause' to 'error' with Throwable::getMessage per logging conventions - Add comment documenting the all-or-nothing startup assumption at the initializationFailed call site Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Address review feedback: lazy logging, argumentSet, and javadoc fixes Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> * Rename transitionAllToStopped to completeDraining The previous name implied all clusters would be transitioned, but the method only completes the draining→stopped transition. The new name better describes the actual behaviour. Assisted-by: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk> --------- Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Type of change
Description
Extracts
VirtualClusterManager(VCM) fromKafkaProxyto own the virtual cluster configuration tree and lifecycle state. VCM is the single source of truth for which virtual clusters exist and their current lifecycle state. It does not manage networking, endpoint registration, or metrics — those remain withKafkaProxy.Key changes:
VirtualClusterManagerwith per-VC lifecycle transitions (initializationSucceeded,initializationFailed) and bulk shutdown transitions (transitionAllToDraining,transitionAllToStopped)onVirtualClusterStoppedcallback notifies KafkaProxy when a VC reaches terminal Stopped stateinitializationFailedchains Failed → Stopped immediately (no recovery path today) and fires the callbacktransitionAllToStoppedreturnsbooleanindicating whether all clusters are now stoppedAdditional Context
Provides a clean foundation for hot reload (kroxylicious#3369) by separating lifecycle ownership from the Netty process container. After this change, KafkaProxy owns event loops, bootstraps, management endpoint, and metrics. VCM owns "what the proxy serves" (models + lifecycle state). The reload path (Serving → Draining → Initializing → Serving) stays entirely within VCM without involving the callback — reload never reaches Stopped.
Builds on kroxylicious#3640 (virtual cluster lifecycle state tracking).
Checklist