OCPBUGS-78832: control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation by wking · Pull Request #7988 · openshift/hypershift

wking · 2026-03-17T18:12:18Z

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

openshift-ci-robot · 2026-03-17T18:12:22Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

coderabbitai · 2026-03-17T18:12:27Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e53f36a9-6ef3-448f-892f-f01b47321e5f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-03-17T18:18:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign jparrill for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2026-03-18T01:42:21Z

@wking: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/verify	`1a59094`	link	true	`/test verify`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

… include.release.openshift.io/bootstrap-cluster-version-operator annotation The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster [1,2]. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load. We could use the CVO's render command like the standalone installer [3,4], but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist. All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since [5]. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs. The new "is there a .status.history entry?" guard keeps this loop from running if we already have a functioning cluster-version operator (we don't want to be wrestling with the CVO over the state of the ClusterVersion CRD). The 'oc apply' (instead of 'oc create') gives us a clear "all of those exist now" exit code we can use to break out of the loop during the initial setup (because this init-container needs to complete before the long-running CVO container can start). I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (Refactor cluster-version-operator, 2024-11-18, openshift#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources. I'm also shifting the ClusterVersion custom resource apply into the loop, to avoid attempting to apply before the ClusterVersion CRD exists and to more gracefully recover from temporary API hiccup sorts of things. I'm also adding some debugging echos and other output to make it easier to debug "hey, why is it applying these resources that I didn't expect it to?" or "... not applying the resources I did expect?". [1]: https://github.com/openshift/enhancements/blob/2b38513b8661632f08e64f4acc3b856e842f8669/dev-guide/cluster-version-operator/dev/operators.md#manifest-inclusion-annotations [2]: https://github.com/openshift/library-go/blob/ac826d10cb4081fe3034b027863c08953d95f602/pkg/manifest/manifest.go#L296-L376 [3]: https://github.com/openshift/installer/blob/a300d8c0e9d9d566a85740244a7da74d3d63e23c/data/data/bootstrap/files/usr/local/bin/bootkube.sh.template#L189-L216 [4]: https://github.com/openshift/cluster-version-operator/blob/eaf28f5165bde27435b0f0c9a69458677034a58d/pkg/payload/render.go [5]: openshift/cluster-version-operator#1352

…r-version-operator: Regenerate Regenerate with: $ UPDATE=true make test

openshift-ci-robot · 2026-03-19T02:42:39Z

@wking: This pull request references Jira Issue OCPBUGS-78832, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2026-04-18T09:30:06Z

Stale PRs are closed after 21d of inactivity.

If this PR is still relevant, comment to refresh it or remove the stale label.
Mark the PR as fresh by commenting /remove-lifecycle stale.

If this PR is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2026-05-02T10:00:39Z

Stale PRs rot after 14d of inactivity.

Mark the PR as fresh by commenting /remove-lifecycle rotten.
Rotten PRs close after an additional 7d of inactivity.

If this PR is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

hypershift-jira-solve-ci · 2026-05-02T13:44:24Z

Prow Job Failure Analysis: PR #7988

PR: OCPBUGS-78832 — control-plane-operator/controllers/hostedcontrolplane/v2/cvo: Consume include.release.openshift.io/hypershift-bootstrap annotation
Repository: openshift/hypershift

Job 1: `ci/prow/verify`

Build ID: 2034407152830910464
Status: ❌ Failed

Root Cause

Gitlint validation failure on two commits in the PR. The commit messages violate conventional commit rules:

CT1 — Missing conventional-commit prefix (e.g., fix:, feat:, chore:)
T1 — Title line exceeds 120 characters (144 characters)
B1 — Body lines exceed 140 characters (URLs pushing lines over limit)

The make run-gitlint target exits with error code 5, failing the verify step.

Recommendations

Rewrite commit messages to use the conventional commit format required by the repo (e.g., fix(cvo): Consume include.release.openshift.io/hypershift-bootstrap annotation).
Shorten the title line to ≤120 characters. Move detail into the commit body.
Wrap body lines at 140 characters. Use bare URLs on their own line or shorten them if needed.
Run make run-gitlint locally before pushing to catch formatting issues early.

Evidence

Commit 1:
  CT1: Title does not start with a conventional-commit prefix
  T1:  Title exceeds max length (144>120)
  B1:  Body line exceeds max length (>140 chars)

Commit 2:
  CT1: Title does not start with a conventional-commit prefix

make: *** [Makefile:423: run-gitlint] Error 5

Job 2: `ci/prow/e2e-azure-self-managed`

Build ID: 2034407152449228800
Status: ❌ Failed
Failed Test: TestCreateCluster/ValidateHostedCluster

Root Cause

The CVO (Cluster Version Operator) bootstrap init container is stuck in an infinite loop because the ClusterVersion CRD (config.openshift.io/v1) is never registered with the API server. The full failure chain:

PR changes the annotation-handling logic for include.release.openshift.io/hypershift-bootstrap (previously include.release.openshift.io/bootstrap-cluster-version-operator). This annotation controls which release manifests are included during the CVO bootstrap phase.

The ClusterVersion CRD is no longer included in the set of manifests applied during bootstrap. Without the CRD, the bootstrap script's oc apply of /tmp/clusterversion.json fails repeatedly with:

error: resource mapping not found for name: "version" namespace: ""
from "/tmp/clusterversion.json": no matches for kind "ClusterVersion"
in version "config.openshift.io/v1"
ensure CRDs are installed first

The bootstrap init container never exits — the entire 8,270-line log is this error repeating in an infinite retry loop (no backoff, no timeout).
The main CVO container never starts — it remains in PodInitializing because init containers must complete first.
No cluster operators are reconciled, so all CVO conditions remain Unknown ("Condition not found in the CVO.") and all 10 control-plane deployments have unavailable replicas.
No worker nodes ever join — the test waits 45 minutes for 2 nodes, but 0 appear.
Test times out: Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded.

This is a product code regression introduced by the PR, not an infrastructure or flake issue.

Recommendations

Verify the ClusterVersion CRD manifest carries the correct annotation (include.release.openshift.io/hypershift-bootstrap: "true" or whichever value the new code expects) so it is included in the bootstrap manifest set.
Check the annotation-filtering logic in the PR's changes to ensure it doesn't exclude CRDs that the bootstrap script depends on. The old annotation (bootstrap-cluster-version-operator) may have had different inclusion semantics than the new one (hypershift-bootstrap).
Add a bootstrap integration test that validates the ClusterVersion CRD is present in the filtered manifest set before the bootstrap script runs.
Consider adding a timeout or error exit to the bootstrap init container script instead of retrying indefinitely — an infinite loop with no backoff masks the root cause during failure investigation.

Evidence

CVO Pod Status (cvo-pod.yaml):

Init Containers:
  availability-prober:  Completed (exitCode 0, finished 23:54:11Z)
  prepare-payload:      Completed (exitCode 0, finished 23:55:38Z)
  bootstrap:            Running (started 23:55:40Z, NEVER completed)
                        ready: false, started: true

Main Container:
  cluster-version-operator: Waiting — reason: PodInitializing
  
Condition: ContainersNotInitialized
  message: "containers with incomplete status: [bootstrap]"

CVO Bootstrap Log (8,270 lines, single repeating error):

error: the server doesn't have a resource type "clusterversions"
Applying CVO bootstrap manifests...
error: resource mapping not found for name: "version" namespace: ""
  from "/tmp/clusterversion.json": no matches for kind "ClusterVersion"
  in version "config.openshift.io/v1"
ensure CRDs are installed first

(This block repeats from line 1 through line 8,270 without ever breaking out.)

Test Output (build-log.txt):

util.go:573: Failed to wait for 2 nodes to become ready in 45m0s:
  context deadline exceeded

Cluster State at Failure:

All CVO conditions: Unknown — "Condition not found in the CVO."
10 deployments with unavailable replicas (kube-apiserver, etcd, ignition-server, etc.)
0 of 2 expected worker nodes joined

prepare-payload log: Empty (completed successfully but produced no diagnostic output).

Artifacts:

Test artifacts: .work/prow-job-analyze-test-failure/2034407152449228800/logs/
Verify artifacts: .work/prow-job-analyze-test-failure/2034407152830910464/logs/

✅ Analysis complete.

openshift-ci · 2026-05-12T10:31:15Z

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

openshift-ci · 2026-05-12T10:32:27Z

@openshift-ci[bot]: Closed this PR.

Details

In response to this:

Rotten PRs close after 7d of inactivity.

Reopen the PR by commenting /reopen.
Mark the PR as fresh by commenting /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot · 2026-05-12T10:32:28Z

@wking: This pull request references Jira Issue OCPBUGS-78832. The bug has been updated to no longer refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

The cluster-version operator has a complicated system for deciding whether a given release-image manifest should be managed in the current cluster. Implementing that system here, or even using library-go and remembering to vendor-bump here, both seem like an annoying maintenance load.

We could use the CVO's render command like the standalone installer, but that logic is fairly complicated because it needs to generate all the artifacts necessary for bootstrap MachineConfig rendering, or the production machine-config operator will complain about MachineConfigPools requesting rendered-... MachineConfig that don't exist.

All we actually need out of the bootstrap container are the resources that the cluster-version operator needs to launch and run, which are labeled with the grep target since openshift/cluster-version-operator#1352. That avoids installing anything the cluster doesn't actually need here by mistake. Once the production CVO container starts, it will apply the remaining resources that the cluster actually needs.

I'm also dropping the openshift-config and openshift-config-managed Namespace creation. They are from a30db71 (#5125), but that commit doesn't explain why they were added or hint at where they lived before (if anywhere). I would expect the cluster-version operator to be able to create those Namespaces from the release-image manifests when they are needed, as with other cluster resources.

Which issue(s) this PR fixes:

Fixes

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-area labels Mar 17, 2026

openshift-ci Bot requested review from devguyio and muraee March 17, 2026 18:17

openshift-ci Bot added the area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release label Mar 17, 2026

openshift-ci Bot removed the do-not-merge/needs-area label Mar 17, 2026

wking mentioned this pull request Mar 17, 2026

OCPBUGS-78832: install: Annotate HyperShift bootstrap manifests with include.release.openshift.io/bootstrap-cluster-version-operator openshift/cluster-version-operator#1352

Open

wking force-pushed the narrowly-scoped-cvo-bootstrap branch 2 times, most recently from 1a59094 to b18cd52 Compare March 18, 2026 01:42

wking added 2 commits March 18, 2026 15:55

control-plane-operator/controllers/hostedcontrolplane/testdata/cluste…

87457d8

…r-version-operator: Regenerate Regenerate with: $ UPDATE=true make test

wking force-pushed the narrowly-scoped-cvo-bootstrap branch from b18cd52 to 87457d8 Compare March 18, 2026 23:10

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2026

openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. labels Mar 19, 2026

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 19, 2026

openshift-ci Bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 18, 2026

openshift-ci Bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 2, 2026

openshift-ci Bot closed this May 12, 2026

Uh oh!

Conversation

wking commented Mar 17, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-ci-robot commented Mar 17, 2026

Uh oh!

coderabbitai Bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci Bot commented Mar 17, 2026

Uh oh!

openshift-ci Bot commented Mar 18, 2026

Uh oh!

openshift-ci-robot commented Mar 19, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

openshift-bot commented Apr 18, 2026

Uh oh!

openshift-bot commented May 2, 2026

Uh oh!

hypershift-jira-solve-ci Bot commented May 2, 2026

Prow Job Failure Analysis: PR #7988

Job 1: ci/prow/verify

Job 2: ci/prow/e2e-azure-self-managed

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

openshift-ci-robot commented May 12, 2026

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai Bot commented Mar 17, 2026 •

edited

Loading

Job 1: `ci/prow/verify`

Job 2: `ci/prow/e2e-azure-self-managed`