Skip to content

refactor(runtime): extract VirtualClusterManager from KafkaProxy#87

Merged
SamBarker merged 6 commits intofeat/vc-lifecyclefrom
feat/vc-manager
Apr 13, 2026
Merged

refactor(runtime): extract VirtualClusterManager from KafkaProxy#87
SamBarker merged 6 commits intofeat/vc-lifecyclefrom
feat/vc-manager

Conversation

@SamBarker
Copy link
Copy Markdown
Owner

Type of change

  • Refactoring

Description

Extracts VirtualClusterManager (VCM) from KafkaProxy to own the virtual cluster configuration tree and lifecycle state. VCM is the single source of truth for which virtual clusters exist and their current lifecycle state. It does not manage networking, endpoint registration, or metrics — those remain with KafkaProxy.

Key changes:

  • New VirtualClusterManager with per-VC lifecycle transitions (initializationSucceeded, initializationFailed) and bulk shutdown transitions (transitionAllToDraining, transitionAllToStopped)
  • onVirtualClusterStopped callback notifies KafkaProxy when a VC reaches terminal Stopped state
  • initializationFailed chains Failed → Stopped immediately (no recovery path today) and fires the callback
  • transitionAllToStopped returns boolean indicating whether all clusters are now stopped
  • KafkaProxy delegates lifecycle management to VCM, removing its private transition methods

Additional Context

Provides a clean foundation for hot reload (kroxylicious#3369) by separating lifecycle ownership from the Netty process container. After this change, KafkaProxy owns event loops, bootstraps, management endpoint, and metrics. VCM owns "what the proxy serves" (models + lifecycle state). The reload path (Serving → Draining → Initializing → Serving) stays entirely within VCM without involving the callback — reload never reaches Stopped.

Builds on kroxylicious#3640 (virtual cluster lifecycle state tracking).

Checklist

  • PR raised from a fork of this repository and made from a branch rather than main.
  • Write tests
  • Update documentation
  • Make sure all unit/integration tests pass
  • Make sure all Sonarcloud warnings are addressed or are justifiably ignored.
  • If applicable to the change, make sure system tests pass.
  • If applicable to the change, trigger the performance test suite. Ensure that any degradations to performance numbers are understood and justified.
  • Ensure the PR references relevant issue(s) so they are closed on merging.
  • For user facing changes, update CHANGELOG.md (remember to include changes affecting the API of the test artefacts too).

Copy link
Copy Markdown

@Uzziee Uzziee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamBarker .
Is the plan to add the restartVirtualCluster, removeVirtualCluster, addVirtualCluster methods to this VCM in the hot reload PR ?

@SamBarker
Copy link
Copy Markdown
Owner Author

Thanks @SamBarker .
Is the plan to add the restartVirtualCluster, removeVirtualCluster, addVirtualCluster methods to this VCM in the hot reload PR ?

Yes exactly!

We may do away with the constructor injection at that point too, but maybe not.

Introduces VirtualClusterManager as the single source of truth for which
virtual clusters exist and their current lifecycle state. VCM owns the
cluster config tree and lifecycle managers but does not manage networking,
endpoint registration, or metrics — those remain with KafkaProxy.

The onVirtualClusterStopped callback notifies the owner when a VC reaches
terminal Stopped state, allowing proxy-level policy decisions.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause)
to VirtualClusterManager. On failure, the VC transitions immediately from Failed to
Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with
the failure cause.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager.
transitionAllToStopped() returns boolean indicating whether all clusters are now stopped,
so KafkaProxy doesn't need to poll individual lifecycle managers.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
KafkaProxy now delegates virtual cluster lifecycle management to
VirtualClusterManager instead of managing lifecycle managers directly.
Removes private transitionAllToDraining/transitionAllToStopped methods
and the lifecycleManagers map in favour of VCM.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Move VCM construction from startup() to the constructor so it can be
final and non-nullable, eliminating null guards throughout.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Log at WARN with a message that reflects the serve:none policy intent.
Add TODO noting the callback should eventually drive proxy shutdown
rather than relying on the caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
@SamBarker SamBarker merged commit f36425c into feat/vc-lifecycle Apr 13, 2026
9 checks passed
@SamBarker SamBarker deleted the feat/vc-manager branch April 13, 2026 10:10
SamBarker added a commit that referenced this pull request Apr 13, 2026
* Add VirtualClusterManager with construction, accessors, and tests

Introduces VirtualClusterManager as the single source of truth for which
virtual clusters exist and their current lifecycle state. VCM owns the
cluster config tree and lifecycle managers but does not manage networking,
endpoint registration, or metrics — those remain with KafkaProxy.

The onVirtualClusterStopped callback notifies the owner when a VC reaches
terminal Stopped state, allowing proxy-level policy decisions.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add per-VC lifecycle transitions with callback

Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause)
to VirtualClusterManager. On failure, the VC transitions immediately from Failed to
Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with
the failure cause.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add bulk shutdown transitions and return value

Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager.
transitionAllToStopped() returns boolean indicating whether all clusters are now stopped,
so KafkaProxy doesn't need to poll individual lifecycle managers.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Wire VirtualClusterManager into KafkaProxy

KafkaProxy now delegates virtual cluster lifecycle management to
VirtualClusterManager instead of managing lifecycle managers directly.
Removes private transitionAllToDraining/transitionAllToStopped methods
and the lifecycleManagers map in favour of VCM.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Make VirtualClusterManager non-nullable in KafkaProxy

Move VCM construction from startup() to the constructor so it can be
final and non-nullable, eliminating null guards throughout.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Improve onVirtualClusterStopped callback logging

Log at WARN with a message that reflects the serve:none policy intent.
Add TODO noting the callback should eventually drive proxy shutdown
rather than relying on the caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

---------

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit that referenced this pull request Apr 14, 2026
* Add VirtualClusterManager with construction, accessors, and tests

Introduces VirtualClusterManager as the single source of truth for which
virtual clusters exist and their current lifecycle state. VCM owns the
cluster config tree and lifecycle managers but does not manage networking,
endpoint registration, or metrics — those remain with KafkaProxy.

The onVirtualClusterStopped callback notifies the owner when a VC reaches
terminal Stopped state, allowing proxy-level policy decisions.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add per-VC lifecycle transitions with callback

Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause)
to VirtualClusterManager. On failure, the VC transitions immediately from Failed to
Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with
the failure cause.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add bulk shutdown transitions and return value

Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager.
transitionAllToStopped() returns boolean indicating whether all clusters are now stopped,
so KafkaProxy doesn't need to poll individual lifecycle managers.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Wire VirtualClusterManager into KafkaProxy

KafkaProxy now delegates virtual cluster lifecycle management to
VirtualClusterManager instead of managing lifecycle managers directly.
Removes private transitionAllToDraining/transitionAllToStopped methods
and the lifecycleManagers map in favour of VCM.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Make VirtualClusterManager non-nullable in KafkaProxy

Move VCM construction from startup() to the constructor so it can be
final and non-nullable, eliminating null guards throughout.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Improve onVirtualClusterStopped callback logging

Log at WARN with a message that reflects the serve:none policy intent.
Add TODO noting the callback should eventually drive proxy shutdown
rather than relying on the caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

---------

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit that referenced this pull request Apr 15, 2026
* Add VirtualClusterManager with construction, accessors, and tests

Introduces VirtualClusterManager as the single source of truth for which
virtual clusters exist and their current lifecycle state. VCM owns the
cluster config tree and lifecycle managers but does not manage networking,
endpoint registration, or metrics — those remain with KafkaProxy.

The onVirtualClusterStopped callback notifies the owner when a VC reaches
terminal Stopped state, allowing proxy-level policy decisions.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add per-VC lifecycle transitions with callback

Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause)
to VirtualClusterManager. On failure, the VC transitions immediately from Failed to
Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with
the failure cause.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add bulk shutdown transitions and return value

Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager.
transitionAllToStopped() returns boolean indicating whether all clusters are now stopped,
so KafkaProxy doesn't need to poll individual lifecycle managers.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Wire VirtualClusterManager into KafkaProxy

KafkaProxy now delegates virtual cluster lifecycle management to
VirtualClusterManager instead of managing lifecycle managers directly.
Removes private transitionAllToDraining/transitionAllToStopped methods
and the lifecycleManagers map in favour of VCM.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Make VirtualClusterManager non-nullable in KafkaProxy

Move VCM construction from startup() to the constructor so it can be
final and non-nullable, eliminating null guards throughout.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Improve onVirtualClusterStopped callback logging

Log at WARN with a message that reflects the serve:none policy intent.
Add TODO noting the callback should eventually drive proxy shutdown
rather than relying on the caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

---------

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit that referenced this pull request Apr 16, 2026
…ious#3640)

* Add virtual cluster lifecycle state model

Introduce VirtualClusterLifecycleState as a sealed interface with
records for each state (Initializing, Serving, Draining, Failed, Stopped).
Each state exposes typed transition methods enforcing the state machine
from proposal 016. Failed carries its cause, and Stopped retains the
prior failure cause for diagnostics.

VirtualClusterLifecycleManager wraps the state in an AtomicReference
for thread-safe transitions, with structured logging on each transition.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Wire lifecycle manager into KafkaProxy startup and shutdown

Create a VirtualClusterLifecycleManager per virtual cluster during
startup, transitioning each to Serving on success. On shutdown,
beginShutdown() transitions serving clusters to Draining before the
Netty graceful shutdown, then completeShutdown() transitions them to
Stopped afterwards. Existing fail-fast behaviour is preserved.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Address review feedback on lifecycle manager

- Remove beginShutdown/completeShutdown composite methods from
  VirtualClusterLifecycleManager; KafkaProxy now calls startDraining,
  drainComplete, and stop directly
- Call initializationFailed(cause) from KafkaProxy startup catch block
  so Failed state carries the real failure cause
- Fix race in transition(): use transitionFn.apply(previous) instead of
  state.get() to avoid observing another thread's update
- Log at DEBUG for most transitions, INFO only for initial and terminal
  states

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Remove unused toInitializing() transitions from Draining and Failed

These transitions are valid in the proposal's state machine but have no
callers today — reload/recovery is deferred. Removing to avoid premature
API commitment; they can be re-added when reload is implemented.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Remove unnecessary throws declaration from test method

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Address review feedback: fix double-apply, log levels, and Initializing→Stopped

- Fix transition() to use get()+updateAndGet() instead of re-applying the
  transition function to derive the new state for logging
- Log all transitions at INFO, Failed transitions at WARN
- Support Initializing→Stopped transition so VCs are not orphaned if
  shutdown occurs during startup
- Assert failure cause on Stopped state in startup failure test

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Replace AtomicReference with synchronized in VirtualClusterLifecycleManager

The AtomicReference approach had a race between state.get() and
state.updateAndGet() where another thread could advance the state,
causing incorrect log messages. Using synchronized makes the
read-transition-log sequence atomic. This is sufficient for the
single-shot proxy lifecycle where contention is not a concern.

Also adjusts log levels: INFO for significant states (Serving, Failed,
Stopped), DEBUG for intermediate transitions (Draining).

Closes kroxylicious#3696

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* refactor(runtime): extract VirtualClusterManager from KafkaProxy (#87)

* Add VirtualClusterManager with construction, accessors, and tests

Introduces VirtualClusterManager as the single source of truth for which
virtual clusters exist and their current lifecycle state. VCM owns the
cluster config tree and lifecycle managers but does not manage networking,
endpoint registration, or metrics — those remain with KafkaProxy.

The onVirtualClusterStopped callback notifies the owner when a VC reaches
terminal Stopped state, allowing proxy-level policy decisions.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add per-VC lifecycle transitions with callback

Adds initializationSucceeded(clusterName) and initializationFailed(clusterName, cause)
to VirtualClusterManager. On failure, the VC transitions immediately from Failed to
Stopped (no recovery path today) and fires the onVirtualClusterStopped callback with
the failure cause.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Add bulk shutdown transitions and return value

Adds transitionAllToDraining() and transitionAllToStopped() to VirtualClusterManager.
transitionAllToStopped() returns boolean indicating whether all clusters are now stopped,
so KafkaProxy doesn't need to poll individual lifecycle managers.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Wire VirtualClusterManager into KafkaProxy

KafkaProxy now delegates virtual cluster lifecycle management to
VirtualClusterManager instead of managing lifecycle managers directly.
Removes private transitionAllToDraining/transitionAllToStopped methods
and the lifecycleManagers map in favour of VCM.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Make VirtualClusterManager non-nullable in KafkaProxy

Move VCM construction from startup() to the constructor so it can be
final and non-nullable, eliminating null guards throughout.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Improve onVirtualClusterStopped callback logging

Log at WARN with a message that reflects the serve:none policy intent.
Add TODO noting the callback should eventually drive proxy shutdown
rather than relying on the caller.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

---------

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Address review feedback: naming conventions, javadoc, and logging

- Rename getState()/getClusterName() to state()/clusterName() on
  VirtualClusterLifecycleManager per project accessor conventions
- Rename getVirtualClusterModels() to virtualClusterModels() on
  VirtualClusterManager
- Rename terse field 'vcm' to 'virtualClusterManager' in KafkaProxy
- Fix transitionAllToDraining() javadoc: clarify that Failed/Stopped
  clusters are skipped, not transitioned by this method
- Fix logging key from 'cause' to 'error' with Throwable::getMessage
  per logging conventions
- Add comment documenting the all-or-nothing startup assumption at
  the initializationFailed call site

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Address review feedback: lazy logging, argumentSet, and javadoc fixes

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

* Rename transitionAllToStopped to completeDraining

The previous name implied all clusters would be transitioned, but the
method only completes the draining→stopped transition. The new name
better describes the actual behaviour.

Assisted-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>

---------

Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants