Skip to content

fix: 3.3 tests#1007

Merged
openshift-merge-bot[bot] merged 1 commit into
project-codeflare:3.3from
pawelpaszki:3.3-test-fix
Feb 12, 2026
Merged

fix: 3.3 tests#1007
openshift-merge-bot[bot] merged 1 commit into
project-codeflare:3.3from
pawelpaszki:3.3-test-fix

Conversation

@pawelpaszki

Copy link
Copy Markdown
Contributor

Issue link

[

](fix: 3.3 tests)

What changes have been made

dependencies update to fix test failures

Verification steps

before fix:

tests/e2e/mnist_raycluster_sdk_oauth_test.py::TestRayClusterSDKOauth::test_mnist_ray_cluster_sdk_auth creating Kueue resources ...
'test-resource-flavor-tm3w9' created!
'test-cluster-queue-uh9sx' created
'test-local-queue-9ikmd' created in namespace 'test-ns-ldlit'
Insecure request warnings have been disabled
Warning: TLS verification has been disabled - Endpoint checks will be bypassed
Written to: /root/.codeflare/resources/mnist.yaml
Written to: /root/.codeflare/resources/mnist.yaml
Ray Cluster: 'mnist' has successfully been applied. For optimal resource management, you should delete this Ray Cluster when no longer in use.
Waiting for client TLS configuration to be available...
Client TLS configuration ready
Cluster 'mnist' is ready. Use cluster.details() to see the status.
                     🚀 CodeFlare Cluster Status 🚀                     
                                                                        
 ╭────────────────────────────────────────────────────────────────────╮ 
 │   Name                                                             │ 
 │   mnist                                              Inactive ❌   │ 
 │                                                                    │ 
 │   URI: ray://mnist-head-svc.test-ns-ldlit.svc:10001                │ 
 │                                                                    │ 
 │   Dashboard🔗                                                      │ 
 │                                                                    │ 
 ╰────────────────────────────────────────────────────────────────────╯ 
Waiting for requested resources to be set up...
Requested cluster is up and running!
Dashboard is ready!
                    🚀 CodeFlare Cluster Status 🚀                    
                                                                      
 ╭──────────────────────────────────────────────────────────────────╮ 
 │   Name                                                           │ 
 │   mnist                                              Active ✅   │ 
 │                                                                  │ 
 │   URI: ray://mnist-head-svc.test-ns-ldlit.svc:10001              │ 
 │                                                                  │ 
 │   Dashboard🔗                                                    │ 
 │                                                                  │ 
 ╰──────────────────────────────────────────────────────────────────╯ 
                    🚀 CodeFlare Cluster Details 🚀                   
                                                                      
 ╭──────────────────────────────────────────────────────────────────╮ 
 │   Name                                                           │ 
 │   mnist                                              Active ✅   │ 
 │                                                                  │ 
 │   URI: ray://mnist-head-svc.test-ns-ldlit.svc:10001              │ 
 │                                                                  │ 
 │   Dashboard🔗                                                    │ 
 │                                                                  │ 
 │                       Cluster Resources                          │ 
 │   ╭── Workers ──╮  ╭───────── Worker specs(each) ─────────╮      │ 
 │   │  # Workers  │  │  Memory      CPU         GPU         │      │ 
 │   │             │  │                                      │      │ 
 │   │  1          │  │  6G~8G       1~1         0           │      │ 
 │   │             │  │                                      │      │ 
 │   ╰─────────────╯  ╰──────────────────────────────────────╯      │ 
 ╰──────────────────────────────────────────────────────────────────╯ 
Verified: No jobs exist from the previous unauthenticated submission attempt.
2026-02-11 16:50:23,936	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_4f6d28abbabb58dd.zip.
2026-02-11 16:50:23,939	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_wEv8f3kvTkzb8ej7
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
RUNNING
2026-02-11 16:50:26,782	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_wEv8f3kvTkzb8ej7: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  None
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
Traceback (most recent call last):
  File "/tmp/ray/session_2026-02-11_16-49-51_649279_1/runtime_resources/working_dir_files/_ray_pkg_4f6d28abbabb58dd/mnist.py", line 249, in <module>
    trainer = Trainer(
              ^^^^^^^^
  File "/tmp/ray/session_2026-02-11_16-49-51_649279_1/runtime_resources/pip/2cec51fff32fcf50498d4dcebb1cf3d67976be07/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
TypeError: Trainer.__init__() got an unexpected keyword argument 'replace_sampler_ddp'

Job has completed: 'FAILED'
FAILED

after the fix

tests/e2e/mnist_raycluster_sdk_oauth_test.py::TestRayClusterSDKOauth::test_mnist_ray_cluster_sdk_auth creating Kueue resources ...
'test-resource-flavor-u8dya' created!
'test-cluster-queue-viwv9' created
'test-local-queue-7tmil' created in namespace 'test-ns-usp8a'
Insecure request warnings have been disabled
Warning: TLS verification has been disabled - Endpoint checks will be bypassed
Written to: /root/.codeflare/resources/mnist.yaml
Written to: /root/.codeflare/resources/mnist.yaml
Ray Cluster: 'mnist' has successfully been applied. For optimal resource management, you should delete this Ray Cluster when no longer in use.
Waiting for client TLS configuration to be available...
Client TLS configuration ready
Cluster 'mnist' is ready. Use cluster.details() to see the status.
                     🚀 CodeFlare Cluster Status 🚀                     
                                                                        
 ╭────────────────────────────────────────────────────────────────────╮ 
 │   Name                                                             │ 
 │   mnist                                              Inactive ❌   │ 
 │                                                                    │ 
 │   URI: ray://mnist-head-svc.test-ns-usp8a.svc:10001                │ 
 │                                                                    │ 
 │   Dashboard🔗                                                      │ 
 │                                                                    │ 
 ╰────────────────────────────────────────────────────────────────────╯ 
Waiting for requested resources to be set up...
Requested cluster is up and running!
Dashboard is ready!
                    🚀 CodeFlare Cluster Status 🚀                    
                                                                      
 ╭──────────────────────────────────────────────────────────────────╮ 
 │   Name                                                           │ 
 │   mnist                                              Active ✅   │ 
 │                                                                  │ 
 │   URI: ray://mnist-head-svc.test-ns-usp8a.svc:10001              │ 
 │                                                                  │ 
 │   Dashboard🔗                                                    │ 
 │                                                                  │ 
 ╰──────────────────────────────────────────────────────────────────╯ 
                    🚀 CodeFlare Cluster Details 🚀                   
                                                                      
 ╭──────────────────────────────────────────────────────────────────╮ 
 │   Name                                                           │ 
 │   mnist                                              Active ✅   │ 
 │                                                                  │ 
 │   URI: ray://mnist-head-svc.test-ns-usp8a.svc:10001              │ 
 │                                                                  │ 
 │   Dashboard🔗                                                    │ 
 │                                                                  │ 
 │                       Cluster Resources                          │ 
 │   ╭── Workers ──╮  ╭───────── Worker specs(each) ─────────╮      │ 
 │   │  # Workers  │  │  Memory      CPU         GPU         │      │ 
 │   │             │  │                                      │      │ 
 │   │  1          │  │  6G~8G       1~1         0           │      │ 
 │   │             │  │                                      │      │ 
 │   ╰─────────────╯  ╰──────────────────────────────────────╯      │ 
 ╰──────────────────────────────────────────────────────────────────╯ 
Verified: No jobs exist from the previous unauthenticated submission attempt.
2026-02-11 16:59:12,829	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_f4f95e18089c79b9.zip.
2026-02-11 16:59:12,831	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_ENkW2xHSU4dtDrsd
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
RUNNING
RUNNING
RUNNING
2026-02-11 16:59:15,696	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_ENkW2xHSU4dtDrsd: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  None
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Downloading MNIST dataset...
Using default MNIST mirror reference to download datasets...

  0%|          | 0.00/9.91M [00:00<?, ?B/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 136MB/s]

  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 53.5MB/s]

  0%|          | 0.00/1.65M [00:00<?, ?B/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 233MB/s]

  0%|          | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 19.6MB/s]

  | Name          | Type               | Params | Mode 
-------------------------------------------------------------
0 | model         | Sequential         | 55.1 K | train
1 | val_accuracy  | MulticlassAccuracy | 0      | train
2 | test_accuracy | MulticlassAccuracy | 0      | train
-------------------------------------------------------------
55.1 K    Trainable params
0         Non-trainable params
55.1 K    Total params
0.220     Total estimated model params size (MB)
11        Modules in train mode
0         Modules in eval mode

Sanity Checking: |          | 0/? [00:00<?, ?it/s]/tmp/ray/session_2026-02-11_16-58-41_332977_1/runtime_resources/pip/2cec51fff32fcf50498d4dcebb1cf3d67976be07/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Sanity Checking:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 74.57it/s]/tmp/ray/session_2026-02-11_16-58-41_332977_1/runtime_resources/pip/2cec51fff32fcf50498d4dcebb1cf3d67976be07/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

                                                                           
/tmp/ray/session_2026-02-11_16-58-41_332977_1/runtime_resources/pip/2cec51fff32fcf50498d4dcebb1cf3d67976be07/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/tmp/ray/session_2026-02-11_16-58-41_332977_1/runtime_resources/pip/2cec51fff32fcf50498d4dcebb1cf3d67976be07/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py:298: The number of training batches (16) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

Training: |          | 0/? [00:00<?, ?it/s]
Training:   0%|          | 0/16 [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s] 
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 69.32it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 69.26it/s, v_num=0]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 76.02it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 74.85it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 74.42it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 75.07it/s]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.31it/s, v_num=0, val_loss=2.210, val_acc=0.349]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.31it/s, v_num=0, val_loss=2.210, val_acc=0.349]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.210, val_acc=0.349]         
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.210, val_acc=0.349]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.33it/s, v_num=0, val_loss=2.210, val_acc=0.349]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.27it/s, v_num=0, val_loss=2.210, val_acc=0.349]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 76.63it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 74.84it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 74.41it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 74.72it/s]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 12.21it/s, v_num=0, val_loss=2.070, val_acc=0.480]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 12.20it/s, v_num=0, val_loss=2.070, val_acc=0.480]
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.070, val_acc=0.480]         
Epoch 2:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.070, val_acc=0.480]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 70.03it/s, v_num=0, val_loss=2.070, val_acc=0.480]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 69.96it/s, v_num=0, val_loss=2.070, val_acc=0.480]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 77.81it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 76.54it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 75.56it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 75.86it/s]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 12.44it/s, v_num=0, val_loss=1.890, val_acc=0.521]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 12.44it/s, v_num=0, val_loss=1.890, val_acc=0.521]`Trainer.fit` stopped: `max_epochs=3` reached.

Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 12.40it/s, v_num=0, val_loss=1.890, val_acc=0.521]

Job has completed: 'SUCCEEDED'
Warning: TLS verification has been disabled - Endpoint checks will be bypassed
Yaml resources loaded for mnist
                    🚀 CodeFlare Cluster Details 🚀                   
                                                                      
 ╭──────────────────────────────────────────────────────────────────╮ 
 │   Name                                                           │ 
 │   mnist                                              Active ✅   │ 
 │                                                                  │ 
 │   URI: ray://mnist-head-svc.test-ns-usp8a.svc:10001              │ 
 │                                                                  │ 
 │   Dashboard🔗                                                    │ 
 │                                                                  │ 
 │                       Cluster Resources                          │ 
 │   ╭── Workers ──╮  ╭───────── Worker specs(each) ─────────╮      │ 
 │   │  # Workers  │  │  Memory      CPU         GPU         │      │ 
 │   │             │  │                                      │      │ 
 │   │  1          │  │  8G~6G       1~1         0           │      │ 
 │   │             │  │                                      │      │ 
 │   ╰─────────────╯  ╰──────────────────────────────────────╯      │ 
 ╰──────────────────────────────────────────────────────────────────╯ 
2026-02-11 17:01:21,122	INFO dashboard_sdk.py:402 -- Package gcs://_ray_pkg_f4f95e18089c79b9.zip already exists, skipping upload.
Submitted job with ID: raysubmit_e871gxeeD81RVXsR
List of Jobs: [JobDetails(type=<JobType.SUBMISSION: 'SUBMISSION'>, job_id=None, submission_id='raysubmit_e871gxeeD81RVXsR', driver_info=None, status=<JobStatus.PENDING: 'PENDING'>, entrypoint='python mnist.py', message='Job has not started yet. It may be waiting for resources (CPUs, GPUs, memory, custom resources) to become available. It may be waiting for the runtime environment to be set up.', error_type=None, start_time=1770829281441, end_time=None, metadata={}, runtime_env={'working_dir': 'gcs://_ray_pkg_f4f95e18089c79b9.zip', 'pip': {'packages': ['--extra-index-url https://download.pytorch.org/whl/cu118', 'torch==2.7.1+cu118', 'torchvision==0.22.1+cu118', 'pytorch_lightning==2.4.0', 'torchmetrics==1.8.2', 'minio'], 'pip_check': False}, 'env_vars': {'PIP_INDEX_URL': 'https://pypi.org/simple/', 'PIP_TRUSTED_HOST': 'pypi.org'}, '_ray_commit': '4ebdc0abe5e5a551625fe7f87053c7e668a6ff74'}, driver_agent_http_address=None, driver_node_id=None, driver_exit_code=None)]
Ray Cluster: 'mnist' has successfully been deleted
PASSED

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 12, 2026
@openshift-ci

openshift-ci Bot commented Feb 12, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kryanbeane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 12, 2026
@openshift-merge-bot openshift-merge-bot Bot merged commit 9d01e0d into project-codeflare:3.3 Feb 12, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants