Skip to content

OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation#7988

Open
wking wants to merge 2 commits intoopenshift:mainfrom
wking:narrowly-scoped-cvo-bootstrap
Open

OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation#7988
wking wants to merge 2 commits intoopenshift:mainfrom
wking:narrowly-scoped-cvo-bootstrap

Conversation

@wking
Copy link
Copy Markdown
Member

@wking wking commented Mar 17, 2026

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e53f36a9-6ef3-448f-892f-f01b47321e5f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Mar 17, 2026
@openshift-ci openshift-ci Bot requested review from devguyio and muraee March 17, 2026 18:17
@openshift-ci openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Mar 17, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 17, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 18, 2026

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify 1a59094 link true /test verify

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

wking added 2 commits March 18, 2026 15:55
… include.release.openshift.io/bootstrap-cluster-version-operator annotation

The cluster-version operator has a complicated system for deciding
whether a given release-image manifest should be managed in the
current cluster [1,2].  Implementing that system here, or even using
library-go and remembering to vendor-bump here, both seem like an
annoying maintenance load.

We could use the CVO's render command like the standalone installer
[3,4], but that logic is fairly complicated because it needs to
generate all the artifacts necessary for bootstrap MachineConfig
rendering, or the production machine-config operator will complain
about MachineConfigPools requesting rendered-... MachineConfig that
don't exist.

All we actually need out of the bootstrap container are the resources
that the cluster-version operator needs to launch and run, which are
labeled with the grep target since [5].  That avoids installing
anything the cluster doesn't actually need here by mistake.  Once the
production CVO container starts, it will apply the remaining resources
that the cluster actually needs.

The new "is there a .status.history entry?" guard keeps this loop from
running if we already have a functioning cluster-version operator (we
don't want to be wrestling with the CVO over the state of the
ClusterVersion CRD).  The 'oc apply' (instead of 'oc create') gives us
a clear "all of those exist now" exit code we can use to break out of
the loop during the initial setup (because this init-container needs
to complete before the long-running CVO container can start).

I'm also dropping the openshift-config and openshift-config-managed
Namespace creation.  They are from a30db71 (Refactor
cluster-version-operator, 2024-11-18, openshift#5125), but that commit doesn't
explain why they were added or hint at where they lived before (if
anywhere).  I would expect the cluster-version operator to be able to
create those Namespaces from the release-image manifests when they are
needed, as with other cluster resources.

I'm also shifting the ClusterVersion custom resource apply into the
loop, to avoid attempting to apply before the ClusterVersion CRD
exists and to more gracefully recover from temporary API hiccup sorts
of things.

I'm also adding some debugging echos and other output to make it
easier to debug "hey, why is it applying these resources that I didn't
expect it to?" or "... not applying the resources I did expect?".

[1]: https://github.com/openshift/enhancements/blob/2b38513b8661632f08e64f4acc3b856e842f8669/dev-guide/cluster-version-operator/dev/operators.md#manifest-inclusion-annotations
[2]: https://github.com/openshift/library-go/blob/ac826d10cb4081fe3034b027863c08953d95f602/pkg/manifest/manifest.go#L296-L376
[3]: https://github.com/openshift/installer/blob/a300d8c0e9d9d566a85740244a7da74d3d63e23c/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L189-L216
[4]: https://github.com/openshift/cluster-version-operator/blob/eaf28f5165bde27435b0f0c9a69458677034a58d/pkg/payload/render.go
[5]: openshift/cluster-version-operator#1352
…r-version-operator: Regenerate

Regenerate with:

  $ UPDATE=true make test
@wking wking force-pushed the narrowly-scoped-cvo-bootstrap branch from b18cd52 to 87457d8 Compare March 18, 2026 23:10
@wking wking changed the title WIP: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation Mar 19, 2026
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 19, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@wking: This pull request references Jira Issue OCPBUGS-78832, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 19, 2026
@openshift-bot
Copy link
Copy Markdown

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2026
@openshift-bot
Copy link
Copy Markdown

Stale PRs rot after 14d of inactivity.

Mark the PR as fresh by commenting /remove-lifecycle rotten.
Rotten PRs close after an additional 7d of inactivity.

If this PR is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2026
@hypershift-jira-solve-ci
Copy link
Copy Markdown

Prow Job Failure Analysis: PR #7988

PR: OCPBUGS-78832 — control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation
Repository: openshift/hypershift


Job 1: ci/prow/verify

Build ID: 2034407152830910464
Status: ❌ Failed

Root Cause

Gitlint validation failure on two commits in the PR. The commit messages violate conventional commit rules:

  • CT1 — Missing conventional-commit prefix (e.g., fix:, feat:, chore:)
  • T1 — Title line exceeds 120 characters (144 characters)
  • B1 — Body lines exceed 140 characters (URLs pushing lines over limit)

The make run-gitlint target exits with error code 5, failing the verify step.

Recommendations
  1. Rewrite commit messages to use the conventional commit format required by the repo (e.g., fix(cvo): Consume include.release.openshift.io/hypershift-bootstrap annotation).
  2. Shorten the title line to ≤120 characters. Move detail into the commit body.
  3. Wrap body lines at 140 characters. Use bare URLs on their own line or shorten them if needed.
  4. Run make run-gitlint locally before pushing to catch formatting issues early.
Evidence
Commit 1:
  CT1: Title does not start with a conventional-commit prefix
  T1:  Title exceeds max length (144>120)
  B1:  Body line exceeds max length (>140 chars)

Commit 2:
  CT1: Title does not start with a conventional-commit prefix

make: *** [Makefile:423: run-gitlint] Error 5

Job 2: ci/prow/e2e-azure-self-managed

Build ID: 2034407152449228800
Status: ❌ Failed
Failed Test: TestCreateCluster/ValidateHostedCluster

Root Cause

The CVO (Cluster Version Operator) bootstrap init container is stuck in an infinite loop because the ClusterVersion CRD (config.openshift.io/v1) is never registered with the API server. The full failure chain:

  1. PR changes the annotation-handling logic for include.release.openshift.io/hypershift-bootstrap (previously include.release.openshift.io/bootstrap-cluster-version-operator). This annotation controls which release manifests are included during the CVO bootstrap phase.
  2. The ClusterVersion CRD is no longer included in the set of manifests applied during bootstrap. Without the CRD, the bootstrap script's oc apply of /tmp/clusterversion.json fails repeatedly with:
    error: resource mapping not found for name: "version" namespace: ""
    from "/tmp/clusterversion.json": no matches for kind "ClusterVersion"
    in version "config.openshift.io/v1"
    ensure CRDs are installed first
    
  3. The bootstrap init container never exits — the entire 8,270-line log is this error repeating in an infinite retry loop (no backoff, no timeout).
  4. The main CVO container never starts — it remains in PodInitializing because init containers must complete first.
  5. No cluster operators are reconciled, so all CVO conditions remain Unknown ("Condition not found in the CVO.") and all 10 control-plane deployments have unavailable replicas.
  6. No worker nodes ever join — the test waits 45 minutes for 2 nodes, but 0 appear.
  7. Test times out: Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded.

This is a product code regression introduced by the PR, not an infrastructure or flake issue.

Recommendations
  1. Verify the ClusterVersion CRD manifest carries the correct annotation (include.release.openshift.io/hypershift-bootstrap: "true" or whichever value the new code expects) so it is included in the bootstrap manifest set.
  2. Check the annotation-filtering logic in the PR's changes to ensure it doesn't exclude CRDs that the bootstrap script depends on. The old annotation (bootstrap-cluster-version-operator) may have had different inclusion semantics than the new one (hypershift-bootstrap).
  3. Add a bootstrap integration test that validates the ClusterVersion CRD is present in the filtered manifest set before the bootstrap script runs.
  4. Consider adding a timeout or error exit to the bootstrap init container script instead of retrying indefinitely — an infinite loop with no backoff masks the root cause during failure investigation.
Evidence

CVO Pod Status (cvo-pod.yaml):

Init Containers:
  availability-prober:  Completed (exitCode 0, finished 23:54:11Z)
  prepare-payload:      Completed (exitCode 0, finished 23:55:38Z)
  bootstrap:            Running (started 23:55:40Z, NEVER completed)
                        ready: false, started: true

Main Container:
  cluster-version-operator: Waiting — reason: PodInitializing
  
Condition: ContainersNotInitialized
  message: "containers with incomplete status: [bootstrap]"

CVO Bootstrap Log (8,270 lines, single repeating error):

error: the server doesn't have a resource type "clusterversions"
Applying CVO bootstrap manifests...
error: resource mapping not found for name: "version" namespace: ""
  from "/tmp/clusterversion.json": no matches for kind "ClusterVersion"
  in version "config.openshift.io/v1"
ensure CRDs are installed first

(This block repeats from line 1 through line 8,270 without ever breaking out.)

Test Output (build-log.txt):

util.go:573: Failed to wait for 2 nodes to become ready in 45m0s:
  context deadline exceeded

Cluster State at Failure:

  • All CVO conditions: Unknown — "Condition not found in the CVO."
  • 10 deployments with unavailable replicas (kube-apiserver, etcd, ignition-server, etc.)
  • 0 of 2 expected worker nodes joined

prepare-payload log: Empty (completed successfully but produced no diagnostic output).


Artifacts:

  • Test artifacts: .work/prow-job-analyze-test-failure/2034407152449228800/logs/
  • Verify artifacts: .work/prow-job-analyze-test-failure/2034407152830910464/logs/

✅ Analysis complete.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants