Skip to content

V0.31 pre upgrade clean#1097

Merged
openshift-merge-bot[bot] merged 1 commit into
project-codeflare:v0.31from
pawelpaszki:v0.31-pre-upgrade-clean
Jun 15, 2026
Merged

V0.31 pre upgrade clean#1097
openshift-merge-bot[bot] merged 1 commit into
project-codeflare:v0.31from
pawelpaszki:v0.31-pre-upgrade-clean

Conversation

@pawelpaszki

@pawelpaszki pawelpaszki commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Issue link

RHOAIENG-63111 — RHOAI 2.25 pre-upgrade: enable vanilla pre_upgrade tests and integrate migration script.

Parent epic: RHOAIENG-63109 (Ray upgrade qualification 2.25 → 3.x). Program: RHAISTRAT-1519.

Closes RHOAIENG-63111


What changes have been made

This PR adds the 2.25 pre-upgrade qualification path for codeflare-sdk. Post-upgrade migration (2→3 normalize/recreate) remains on the v0.36 line; this branch is intentionally pre_upgrade only.

Base branch: v0.31 (origin/v0.31, release 0.31.5) — includes upstream Ray 2.52.1 and cryptography 46.0.5 (not the older v0.31.1-test-fix dependency line).

Enable vanilla pre-upgrade tests

  • Container test image (images/tests/): Dockerfile, entrypoint.sh, run-tests.sh, RBAC manifest for the test user.
    • Default marker: pre_upgrade (rejects -m post_upgrade on this image).
    • Cluster auth detection (htpasswd/LDAP vs BYOIDC) aligned with test credentials in the env file.
  • Upgrade tests (tests/upgrade/):
    • 01_raycluster_sdk_upgrade_test.py — seed RayCluster mnist in test-ns-rayupgrade with Kueue objects (namespace label, LocalQueue, etc.).
    • 02_dashboard_ui_upgrade_test.py — Workload Metrics UI check (optional path when seed succeeds).
    • conftest.py — UI fixtures; session ordering hooks.
  • UI support (tests/ui/): page objects and login flow for OpenShift OAuth / BYOIDC dashboard tests.
  • tests/e2e/support.py: namespace creation (409 tolerate, Kueue managed label), shared helpers for upgrade tests.
  • pyproject.toml / poetry.lock: selenium, webdriver-manager; pytest markers pre_upgrade, post_upgrade, ui.
  • Makefile: build-test-image / push-test-image for quay.io/opendatahub/codeflare-sdk-tests.

Migration script in pre-upgrade finalize

  • Vendored scripts/migration/ray_cluster_migration.py from rhoai-upgrade-helpers (source of truth on main; keep in sync).
  • 03_ray_migration_pre_upgrade_finalize_test.py — last step in pre_upgrade suite:
    • Runs ray_cluster_migration.py pre-upgrade for the qualification cluster.
    • Asserts codeflare: Removed on DataScienceCluster (mandatory before 2→3 OLM upgrade).
    • Asserts pre-upgrade artifacts (e.g. Ray-owned OpenShift Routes removed when a cluster exists).
  • migration_support.py, constants.py, tests/upgrade/README.md — wrappers, shared IDs, operator docs.
  • run-tests.sh / Dockerfile`: copy migration script into the test image.

Pre-upgrade flow (single pytest session, then stop)

pytest -m pre_upgrade
  01  Seed RayCluster + Kueue (mnist / test-ns-rayupgrade)
  02  Dashboard UI (Workload Metrics) — optional if seed failed
  03  Migration pre-upgrade finalize (script + DSC codeflare Removed)
→ RHOAI OLM upgrade (outside this repo; stable-2.25 → target channel)
→ post_upgrade on v0.36* branch / image (separate PR)

Out of scope (this PR)

  • post_upgrade tests and 2→3 CR normalization ( v0.36-2.25-3.x-post-upgrade / RHOAIENG-63109 post phase).
  • Jenkins pipeline changes (consumers only run the published test image with -m pre_upgrade).
  • Installing operators or changing DSC beyond what the migration script does during finalize.

Verification steps

What was done to verify (author)

Verified on RHOAI 2.25.x OpenShift clusters (QE-style: htpasswd/LDAP, rhoai-catalog-dev FBC, stable-2.25 subscription).

Step Result
Branch rebased cleanly on origin/v0.31 (2 commits only; no stray 2.25 maintenance commits). OK
Test image built from v0.31-pre-upgrade-clean (make build-test-image). OK
pytest tests/upgrade/ -m pre_upgrade in container with cluster env file + kubeconfig. Passing (3 tests; migration finalize + seed + UI as applicable).
Migration script output: pre-upgrade checks, codeflare: Removed, backup under /tmp/rhoai-upgrade-backup/ray, Ray-owned Routes removed when cluster present. OK
Dependencies: Ray 2.52.1, cryptography 46.0.5 (via v0.31.5 base — no need for v0.31.1-test-fix). OK

Known flake (documented, not blocking): on a fresh cluster with Ray images still pulling, KubeRay may recreate OpenShift Routes after the script deletes them but before enableIngress: false is applied. Reruns typically pass. See upgrades/script_fix_suggestion.md in the qualification notes repo.


How to re-verify (reviewers / CI)

A. Cluster prerequisites (Testops / lab — not installed by these tests)

Before running pre_upgrade:

Prerequisite Expected
RHOAI operator 2.25.x (e.g. rhods-operator.2.25.7), subscription stable-2.25 (or stable).
DSC ray: Managed, kueue: Unmanaged (+ RHBoK / openshift-kueue-operator if using Kueue in tests).
DSC for 2→3 path Per upgrade guide before OLM: e.g. codeflare may start Managed — finalize sets Removed.
Catalog FBC that provides 2.25 install channel (and later 3.x upgrade channel in a separate step).
Credentials Env file with OCP_ADMIN_USER_*, TEST_USER_*, ODH_DASHBOARD_URL (or equivalent); see images/tests/run-tests.sh comments.
Same cluster Pre and post phases use the same cluster/kubeconfig across Jenkins or manual runs.

B. Build the test image (this PR branch)

git fetch origin
git checkout v0.31-pre-upgrade-clean   # or PR head branch

cd codeflare-sdk
E2E_TEST_IMAGE_VERSION=test make build-test-image
# Image: quay.io/opendatahub/codeflare-sdk-tests:test

Use linux/amd64 if building on Apple Silicon (Makefile already sets platform).

C. Run pre-upgrade tests against RHOAI 2.25

# Env file: cluster URL, users/passwords, optional BYOIDC_* overrides
podman run --rm --platform linux/amd64 --pull=never \
  -v "${KUBECONFIG:-$HOME/.kube/config}:/codeflare-sdk/tests/.kube/config:ro" \
  -v "$(pwd)/tests/results:/codeflare-sdk/tests/results:Z" \
  --env-file /path/to/cluster-env-file \
  quay.io/opendatahub/codeflare-sdk-tests:test

Equivalent:

pytest tests/upgrade/ -m pre_upgrade -v

Pass criteria:

  • All collected pre_upgrade tests pass.
  • 03 finalize: DataScienceCluster codeflare.managementState: Removed.
  • If 01 seeded a cluster: backup exists under $RHOAI_UPGRADE_BACKUP_DIR/ray (default /tmp/rhoai-upgrade-backup/ray inside the runner).
  • Container exits non-zero if invoked with -m post_upgrade (guard for wrong phase image).

Optional manual script run (same cluster, after tests or standalone):

export PYTHONWARNINGS='ignore::urllib3.exceptions.InsecureRequestWarning'
python scripts/migration/ray_cluster_migration.py pre-upgrade \
  --namespace test-ns-rayupgrade --cluster mnist

D. After pre_upgrade — do not re-run seed without revert

Once finalize has run, codeflare is Removed on the DSC. To re-run pre_upgrade on the same 2.25 cluster without OLM upgrade:

  1. Delete namespace test-ns-rayupgrade (and Kueue objects if needed).
  2. Patch DSC codeflare back to Managed.
  3. Re-run -m pre_upgrade.

Otherwise proceed to RHOAI OLM upgrade (channel per your target, e.g. support-required-upgrade → 3.3 or beta → 3.5 EA) and then post_upgrade on the v0.36 test image/branch.

E. Upgrade verification (required context — separate from this PR)

This PR only qualifies pre state on 2.25. Full 2→3 qualification additionally requires:

  1. OLM upgrade on the same cluster (orchestrated outside codeflare-sdk).
  2. pytest -m post_upgrade using the v0.36 branch image (migration post-upgrade, job submit, UI).

Document those results under RHOAIENG-63109 / post-upgrade PR; not gating merge of this pre-upgrade PR.


Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Manual tests performed: containerized pytest -m pre_upgrade on RHOAI 2.25.7 cluster(s); migration script integration; DSC codeflare: Removed assertion.


@openshift-ci openshift-ci Bot requested review from chipspeak and dimakis June 3, 2026 10:36
@codecov

codecov Bot commented Jun 3, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.47%. Comparing base (fafc912) to head (1648314).
⚠️ Report is 1 commits behind head on v0.31.

Additional details and impacted files
@@           Coverage Diff           @@
##            v0.31    #1097   +/-   ##
=======================================
  Coverage   92.47%   92.47%           
=======================================
  Files          25       25           
  Lines        1395     1395           
=======================================
  Hits         1290     1290           
  Misses        105      105           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@pawelpaszki

Copy link
Copy Markdown
Contributor Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 4, 2026
@pawelpaszki

Copy link
Copy Markdown
Contributor Author

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 12, 2026
pass # Will fall through to namespace search

# If cluster-wide search didn't find anything, try namespace-specific search
if not httproute:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the ray custers namespace be in here since the http route is created in that ns


gateway_ref = parent_refs[0]
gateway_name = gateway_ref.get("name")
gateway_namespace = gateway_ref.get("namespace")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
gateway_namespace = gateway_ref.get("namespace")
gateway_namespace = gateway_ref.get("namespace") or httproute.get("metadata",{}).get("namespace")

parentRefs[].namespace is optional in the Gateway API spec. When it's omitted, it defaults to the HTTPRoute's own namespace. But this line returns None if gateway_namespace is false, so any HTTPRoute that doesn't explicitly set namespace in its parentRef gets silently skipped and the dashboard URL is lost

Comment on lines +2794 to +2797
pre_upgrade(
cluster_name=args.cluster_name,
namespace=args.namespace,
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pre_upgrade(
cluster_name=args.cluster_name,
namespace=args.namespace,
)
backup_files = pre_upgrade(
cluster_name=args.cluster_name,
namespace=args.namespace,
)
if not backup_files:
sys.exit(1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When required pre-flight checks fail, pre_upgrade() returns [] (line 1760) instead of raising. main() calls pre_upgrade() here but doesn't check the return value, so the process exits 0. Anyone running ray_cluster_migration.py pre-upgrade && proceed-with-upgrade will proceed even when pre-flight checks failed. Fix: either have pre_upgrade() raise a SystemExit(1) on required failures, or check the return value in main() and call sys.exit(1)

@pawelpaszki pawelpaszki force-pushed the v0.31-pre-upgrade-clean branch from a77f7e3 to 1648314 Compare June 15, 2026 09:14
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2026
@openshift-ci

openshift-ci Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kryanbeane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 1cf9741 into project-codeflare:v0.31 Jun 15, 2026
10 checks passed
@pawelpaszki pawelpaszki deleted the v0.31-pre-upgrade-clean branch June 15, 2026 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants