test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077
test(RHOAIENG-57445): added E2E test for RayCluster Autoscaling#1077kryanbeane wants to merge 4 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1077 +/- ##
==========================================
+ Coverage 96.61% 96.63% +0.01%
==========================================
Files 23 23
Lines 2306 2316 +10
==========================================
+ Hits 2228 2238 +10
Misses 78 78 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
/hold |
|
@kryanbeane I am getting resources issues on a standard PSI cluster. one of the workers is stuck pending (for this step: also from the race condition mentioned in another commit, the test got stuck on this: and eventually itmed out: |
…esource pressure Co-authored-by: Cursor <cursoragent@cursor.com>
b940332 to
477d4fd
Compare
…rmat Add num-cpus to test YAML templates to match the new rayStartParams field from _cpu_limit_to_num_cpus. Also apply ruff format to build_ray_cluster.py.
Cover all input formats: integer, millicore strings, and whole-number strings to satisfy codecov patch coverage.
ee06873 to
289c265
Compare
Issue link
https://redhat.atlassian.net/browse/RHOAIENG-57445
What changes have been made
Add E2E tests for Ray in-tree autoscaling (non-Kueue path):
tests/e2e/autoscaling_raycluster_sdk_kind_test.py— KinD lifecycle test (scale up + scale down)tests/e2e/autoscaling_raycluster_sdk_oauth_test.py— OpenShift/OAuth lifecycle test (scale up + scale down)tests/e2e/autoscaling_load.py— Ray workload script that creates CPU-bound tasks to trigger autoscalingtests/e2e/support.py— Addedwait_for_worker_count()andrun_autoscaling_load_in_head_pod()helpersBoth tests creaate an autoscaling-enabled RayCluster without Kueue resources, verify scale-up under load, then verify scale-down after idle timeout.
Verification steps
KinD:
kubectl config use-context kind-kind cd tests/e2e poetry run pytest -vv -s autoscaling_raycluster_sdk_kind_test.py -m kindOpenShift:
Expected: cluster scales from
min_workers=1to>=2under load, then back to1after idle timeout.Checks