V0.31 pre upgrade clean#1097
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## v0.31 #1097 +/- ##
=======================================
Coverage 92.47% 92.47%
=======================================
Files 25 25
Lines 1395 1395
=======================================
Hits 1290 1290
Misses 105 105 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/hold |
|
/hold cancel |
| pass # Will fall through to namespace search | ||
|
|
||
| # If cluster-wide search didn't find anything, try namespace-specific search | ||
| if not httproute: |
There was a problem hiding this comment.
shouldn't the ray custers namespace be in here since the http route is created in that ns
|
|
||
| gateway_ref = parent_refs[0] | ||
| gateway_name = gateway_ref.get("name") | ||
| gateway_namespace = gateway_ref.get("namespace") |
There was a problem hiding this comment.
| gateway_namespace = gateway_ref.get("namespace") | |
| gateway_namespace = gateway_ref.get("namespace") or httproute.get("metadata",{}).get("namespace") |
parentRefs[].namespace is optional in the Gateway API spec. When it's omitted, it defaults to the HTTPRoute's own namespace. But this line returns None if gateway_namespace is false, so any HTTPRoute that doesn't explicitly set namespace in its parentRef gets silently skipped and the dashboard URL is lost
| pre_upgrade( | ||
| cluster_name=args.cluster_name, | ||
| namespace=args.namespace, | ||
| ) |
There was a problem hiding this comment.
| pre_upgrade( | |
| cluster_name=args.cluster_name, | |
| namespace=args.namespace, | |
| ) | |
| backup_files = pre_upgrade( | |
| cluster_name=args.cluster_name, | |
| namespace=args.namespace, | |
| ) | |
| if not backup_files: | |
| sys.exit(1) |
There was a problem hiding this comment.
When required pre-flight checks fail, pre_upgrade() returns [] (line 1760) instead of raising. main() calls pre_upgrade() here but doesn't check the return value, so the process exits 0. Anyone running ray_cluster_migration.py pre-upgrade && proceed-with-upgrade will proceed even when pre-flight checks failed. Fix: either have pre_upgrade() raise a SystemExit(1) on required failures, or check the return value in main() and call sys.exit(1)
a77f7e3 to
1648314
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kryanbeane The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1cf9741
into
project-codeflare:v0.31
Issue link
RHOAIENG-63111 — RHOAI 2.25 pre-upgrade: enable vanilla
pre_upgradetests and integrate migration script.Parent epic: RHOAIENG-63109 (Ray upgrade qualification 2.25 → 3.x). Program: RHAISTRAT-1519.
Closes RHOAIENG-63111
What changes have been made
This PR adds the 2.25 pre-upgrade qualification path for codeflare-sdk. Post-upgrade migration (2→3 normalize/recreate) remains on the
v0.36line; this branch is intentionally pre_upgrade only.Base branch:
v0.31(origin/v0.31, release 0.31.5) — includes upstream Ray 2.52.1 and cryptography 46.0.5 (not the olderv0.31.1-test-fixdependency line).Enable vanilla pre-upgrade tests
images/tests/): Dockerfile,entrypoint.sh,run-tests.sh, RBAC manifest for the test user.pre_upgrade(rejects-m post_upgradeon this image).tests/upgrade/):01_raycluster_sdk_upgrade_test.py— seedRayClustermnistintest-ns-rayupgradewith Kueue objects (namespace label, LocalQueue, etc.).02_dashboard_ui_upgrade_test.py— Workload Metrics UI check (optional path when seed succeeds).conftest.py— UI fixtures; session ordering hooks.tests/ui/): page objects and login flow for OpenShift OAuth / BYOIDC dashboard tests.tests/e2e/support.py: namespace creation (409 tolerate, Kueue managed label), shared helpers for upgrade tests.pyproject.toml/poetry.lock:selenium,webdriver-manager; pytest markerspre_upgrade,post_upgrade,ui.Makefile:build-test-image/push-test-imageforquay.io/opendatahub/codeflare-sdk-tests.Migration script in pre-upgrade finalize
scripts/migration/ray_cluster_migration.pyfrom rhoai-upgrade-helpers (source of truth onmain; keep in sync).03_ray_migration_pre_upgrade_finalize_test.py— last step inpre_upgradesuite:ray_cluster_migration.py pre-upgradefor the qualification cluster.codeflare: RemovedonDataScienceCluster(mandatory before 2→3 OLM upgrade).migration_support.py,constants.py,tests/upgrade/README.md— wrappers, shared IDs, operator docs.run-tests.sh/ Dockerfile`: copy migration script into the test image.Pre-upgrade flow (single pytest session, then stop)
Out of scope (this PR)
post_upgradetests and 2→3 CR normalization (v0.36-2.25-3.x-post-upgrade/ RHOAIENG-63109 post phase).-m pre_upgrade).Verification steps
What was done to verify (author)
Verified on RHOAI 2.25.x OpenShift clusters (QE-style: htpasswd/LDAP,
rhoai-catalog-devFBC,stable-2.25subscription).origin/v0.31(2 commits only; no stray 2.25 maintenance commits).v0.31-pre-upgrade-clean(make build-test-image).pytest tests/upgrade/ -m pre_upgradein container with cluster env file + kubeconfig.codeflare: Removed, backup under/tmp/rhoai-upgrade-backup/ray, Ray-owned Routes removed when cluster present.v0.31.5base — no need forv0.31.1-test-fix).Known flake (documented, not blocking): on a fresh cluster with Ray images still pulling, KubeRay may recreate OpenShift Routes after the script deletes them but before
enableIngress: falseis applied. Reruns typically pass. Seeupgrades/script_fix_suggestion.mdin the qualification notes repo.How to re-verify (reviewers / CI)
A. Cluster prerequisites (Testops / lab — not installed by these tests)
Before running
pre_upgrade:rhods-operator.2.25.7), subscriptionstable-2.25(orstable).ray: Managed,kueue: Unmanaged(+ RHBoK / openshift-kueue-operator if using Kueue in tests).codeflaremay start Managed — finalize sets Removed.OCP_ADMIN_USER_*,TEST_USER_*,ODH_DASHBOARD_URL(or equivalent); seeimages/tests/run-tests.shcomments.B. Build the test image (this PR branch)
Use
linux/amd64if building on Apple Silicon (Makefilealready sets platform).C. Run pre-upgrade tests against RHOAI 2.25
Equivalent:
Pass criteria:
pre_upgradetests pass.03finalize:DataScienceClustercodeflare.managementState: Removed.01seeded a cluster: backup exists under$RHOAI_UPGRADE_BACKUP_DIR/ray(default/tmp/rhoai-upgrade-backup/rayinside the runner).-m post_upgrade(guard for wrong phase image).Optional manual script run (same cluster, after tests or standalone):
D. After pre_upgrade — do not re-run seed without revert
Once finalize has run,
codeflareis Removed on the DSC. To re-runpre_upgradeon the same 2.25 cluster without OLM upgrade:test-ns-rayupgrade(and Kueue objects if needed).codeflareback toManaged.-m pre_upgrade.Otherwise proceed to RHOAI OLM upgrade (channel per your target, e.g.
support-required-upgrade→ 3.3 orbeta→ 3.5 EA) and thenpost_upgradeon the v0.36 test image/branch.E. Upgrade verification (required context — separate from this PR)
This PR only qualifies pre state on 2.25. Full 2→3 qualification additionally requires:
pytest -m post_upgradeusing the v0.36 branch image (migration post-upgrade, job submit, UI).Document those results under RHOAIENG-63109 / post-upgrade PR; not gating merge of this pre-upgrade PR.
Checks
Manual tests performed: containerized
pytest -m pre_upgradeon RHOAI 2.25.7 cluster(s); migration script integration; DSCcodeflare: Removedassertion.