Skip to content

🐛 Fix catalogd ha readiness#2674

Open
tmshort wants to merge 2 commits intooperator-framework:mainfrom
tmshort:fix-catalogd-ha-readiness
Open

🐛 Fix catalogd ha readiness#2674
tmshort wants to merge 2 commits intooperator-framework:mainfrom
tmshort:fix-catalogd-ha-readiness

Conversation

@tmshort
Copy link
Copy Markdown
Contributor

@tmshort tmshort commented Apr 24, 2026

Fixes catalogd to permit multiple replicas. Adds tests to the experimental-e2e where there are multiple nodes, and subsequently, multiple replicas. This is part of a fix to avoid OLMv1 becoming unready when a cluster is upgraded.

Related to: OCPBUGS-62517

catalogd's catalog HTTP server previously called net.Listen eagerly at startup on every pod, even non-leaders that never called http.Serve. With replicas > 1 this caused ~50% of catalog requests to queue indefinitely in the kernel accept backlog.

Fix: replace manager.Server with a custom catalogServerRunnable that binds the port lazily inside Start() (only called on the leader) and closes a channel to signal readiness. A /readyz check selects on that channel, so non-leader pods fail the probe and are excluded from Service endpoints. cmd/catalogd/main.go health/readiness setup is now identical to cmd/operator-controller/main.go.

With that fix in place, helm/experimental.yaml is updated to set replicas: 2 for both components so the experimental (2-node kind) e2e suite exercises the multi-replica path. A new @CatalogdHA scenario force-deletes the catalogd leader pod and asserts that a new leader is elected and the catalog resumes serving. The scenario is automatically skipped in the standard 1-node suite (gated via BeforeSuite node-count detection in featureGates). The experimental e2e timeout is bumped from 20m to 25m to accommodate worst-case leader re-election (~163s).

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

The catalog HTTP server has OnlyServeWhenLeader: true, so only the leader
pod should serve catalog content. Previously, net.Listen was called eagerly
at startup for all pods: the listen socket was bound on non-leaders even
though http.Serve was never called, causing TCP connections to queue without
being served. With replicas > 1 this made ~50% of catalog content requests
fail silently.

Replace manager.Server with a custom Runnable (catalogServerRunnable) in
serverutil that:
- Binds the catalog port lazily inside Start(), which is only called on the
  leader by controller-runtime's leader election machinery.
- Closes a ready channel once the listener is established, and registers a
  channel-select readiness check via AddReadyzCheck so non-leader pods fail
  the /readyz probe and are excluded from Service endpoints.

This keeps cmd/catalogd/main.go health/readiness setup identical to
cmd/operator-controller/main.go (healthz.Ping for both liveness and
readiness); the catalog-server readiness check is an implementation detail
of serverutil.AddCatalogServerToManager.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 24, 2026 19:34
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from ankitathomas and oceanc80 April 24, 2026 19:34
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 24, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 99db2dc
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69ebcc81ff9f820008ced85b
😎 Deploy Preview https://deploy-preview-2674--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes catalogd HA behavior and adds an experimental e2e scenario to exercise multi-replica failover, aiming to prevent catalog serving stalls/unreadiness during upgrades.

Changes:

  • Introduces HA-gated e2e steps + a new @CatalogdHA feature scenario that deletes the catalogd leader pod and waits for leader re-election.
  • Updates experimental manifests/helm values to run catalogd and operator-controller with 2 replicas.
  • Reworks catalogd HTTP server integration to use a custom runnable + readiness check.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
internal/catalogd/serverutil/serverutil.go Replaces controller-runtime manager.Server usage with a custom runnable and a readyz check.
test/e2e/steps/steps.go Registers new Godog steps for HA leader failover scenario.
test/e2e/steps/hooks.go Adds a CatalogdHA feature gate and enables it based on node count.
test/e2e/steps/ha_steps.go Implements steps to force-delete the leader pod and detect a newly elected leader.
test/e2e/features/ha.feature Adds @CatalogdHA scenario validating catalog continues serving after leader disruption.
manifests/experimental.yaml Sets catalogd and operator-controller replicas to 2 for experimental installs.
manifests/experimental-e2e.yaml Sets catalogd and operator-controller replicas to 2 for experimental-e2e installs.
helm/experimental.yaml Sets Helm experimental values to deploy both components with 2 replicas.
Makefile Increases default e2e timeout and bumps experimental-e2e timeout to 25m.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile
Comment thread test/e2e/features/ha.feature Outdated
Comment thread internal/catalogd/serverutil/serverutil.go
Comment thread internal/catalogd/serverutil/serverutil.go
Comment thread test/e2e/steps/hooks.go Outdated
The experimental e2e suite uses a 2-node kind cluster, making it a natural
fit to validate HA behaviour.  Set replicas=2 for both components in
helm/experimental.yaml so the experimental and experimental-e2e manifests
exercise the multi-replica path end-to-end.

This is safe for operator-controller (no leader-only HTTP servers) and for
catalogd now that the catalog server starts on all pods via
NeedLeaderElection=false, preventing the rolling-update deadlock that would
arise if the server were leader-only.

Also adds a @CatalogdHA experimental e2e scenario that force-deletes the
catalogd leader pod and verifies that a new leader is elected and the catalog
resumes serving.  The scenario is gated on a 2-node cluster (detected in
BeforeSuite and reflected in the featureGates map), so it is automatically
skipped in the standard 1-node e2e suite.  The experimental e2e timeout is
bumped from 20m to 25m to accommodate leader re-election time (~163s worst
case).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
@tmshort tmshort force-pushed the fix-catalogd-ha-readiness branch from 8c2f948 to 99db2dc Compare April 24, 2026 20:03
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 76.19048% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.00%. Comparing base (7277a46) to head (99db2dc).

Files with missing lines Patch % Lines
internal/catalogd/serverutil/serverutil.go 76.19% 6 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2674      +/-   ##
==========================================
+ Coverage   67.97%   68.00%   +0.03%     
==========================================
  Files         144      144              
  Lines       10573    10596      +23     
==========================================
+ Hits         7187     7206      +19     
- Misses       2866     2869       +3     
- Partials      520      521       +1     
Flag Coverage Δ
e2e 37.16% <76.19%> (-0.18%) ⬇️
experimental-e2e 53.00% <76.19%> (+0.42%) ⬆️
unit 53.52% <0.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants