You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix(catalogd): bind catalog HTTP port lazily; add readiness check
The catalog HTTP server has OnlyServeWhenLeader: true, so only the leader
pod should serve catalog content. Previously, net.Listen was called eagerly
at startup for all pods: the listen socket was bound on non-leaders even
though http.Serve was never called, causing TCP connections to queue without
being served. With replicas > 1 this made ~50% of catalog content requests
fail silently.
Replace manager.Server with a custom Runnable (catalogServerRunnable) in
serverutil that:
- Binds the catalog port lazily inside Start(), which is only called on the
leader by controller-runtime's leader election machinery.
- Closes a ready channel once the listener is established, and registers a
channel-select readiness check via AddReadyzCheck so non-leader pods fail
the /readyz probe and are excluded from Service endpoints.
This keeps cmd/catalogd/main.go health/readiness setup identical to
cmd/operator-controller/main.go (healthz.Ping for both liveness and
readiness); the catalog-server readiness check is an implementation detail
of serverutil.AddCatalogServerToManager.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* feat(experimental): run catalogd and operator-controller with 2 replicas
The experimental e2e suite uses a 2-node kind cluster, making it a natural
fit to validate HA behaviour. Set replicas=2 for both components in
helm/experimental.yaml so the experimental and experimental-e2e manifests
exercise the multi-replica path end-to-end.
This is safe for operator-controller (no leader-only HTTP servers) and for
catalogd now that the catalog server starts on all pods via
NeedLeaderElection=false, preventing the rolling-update deadlock that would
arise if the server were leader-only.
Also adds a @CatalogdHA experimental e2e scenario that force-deletes the
catalogd leader pod and verifies that a new leader is elected and the catalog
resumes serving. The scenario is gated on a 2-node cluster (detected in
BeforeSuite and reflected in the featureGates map), so it is automatically
skipped in the standard 1-node e2e suite. The experimental e2e timeout is
bumped from 20m to 25m to accommodate leader re-election time (~163s worst
case).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
---------
Signed-off-by: Todd Short <tshort@redhat.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments