RHOAIENG-48973: fix notebook tests and add required auto param for trainer accelerator by pawelpaszki · Pull Request #1001 · project-codeflare/codeflare-sdk

pawelpaszki · 2026-02-09T07:55:09Z

Issue link

What changes have been made

RHOAIENG-48973: fix notebook tests and add required auto param for trainer accelerator

Verification steps

for notebook tests - all should pass
for accelerator required param:

tests that require submission of a job without accelerator parameter provided will fail (upgrade and mnist job submit), e.g.

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,342 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,343 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,346 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,346 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,346 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,349 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 08:58:27,993	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_d9977c9a7d4c9bd8.zip.
2026-02-10 08:58:27,997	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_2QzNnqbaezMfH6cZ
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
RUNNING
RUNNING
2026-02-10 08:58:31,233	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_2QzNnqbaezMfH6cZ: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  None
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
Traceback (most recent call last):
  File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/working_dir_files/_ray_pkg_d9977c9a7d4c9bd8/mnist.py", line 245, in <module>
    trainer = Trainer(
              ^^^^^^^^
  File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
    return fn(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 417, in __init__
    self._accelerator_connector = _AcceleratorConnector(
                                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 129, in __init__
    self._check_config_and_set_final_flags(
  File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in _check_config_and_set_final_flags
    raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cuda, tpu, cpu, mps.

Job has completed: 'FAILED'
FAILED

with parameter provided for accelerator, the tests pass:

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,920 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,921 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,924 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,925 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,925 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,928 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 09:09:18,760	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_be21486f12aefa83.zip.
2026-02-10 09:09:18,763	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_FjnqSyCQ43zMrMxM
PENDING
RUNNING
RUNNING
2026-02-10 09:09:21,594	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_FjnqSyCQ43zMrMxM: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  auto
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Downloading MNIST dataset...
Using default MNIST mirror reference to download datasets...

  0%|          | 0.00/9.91M [00:00<?, ?B/s]
  1%|          | 98.3k/9.91M [00:00<00:11, 854kB/s]
  4%|▎         | 360k/9.91M [00:00<00:05, 1.68MB/s]
 15%|█▍        | 1.44M/9.91M [00:00<00:01, 5.51MB/s]
 59%|█████▉    | 5.83M/9.91M [00:00<00:00, 18.7MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 18.8MB/s]

  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 497kB/s]

  0%|          | 0.00/1.65M [00:00<?, ?B/s]
  6%|▌         | 98.3k/1.65M [00:00<00:01, 842kB/s]
 14%|█▍        | 229k/1.65M [00:00<00:01, 1.01MB/s]
 44%|████▎     | 721k/1.65M [00:00<00:00, 2.47MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.03MB/s]

  0%|          | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 15.3MB/s]
┏━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃   ┃ Name          ┃ Type               ┃ Params ┃ Mode  ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ model         │ Sequential         │ 55.1 K │ train │     0 │
│ 1 │ val_accuracy  │ MulticlassAccuracy │      0 │ train │     0 │
│ 2 │ test_accuracy │ MulticlassAccuracy │      0 │ train │     0 │
└───┴───────────────┴────────────────────┴────────┴───────┴───────┘
Trainable params: 55.1 K                                                        
Non-trainable params: 0                                                         
Total params: 55.1 K                                                            
Total estimated model params size (MB): 0                                       
Modules in train mode: 11                                                       
Modules in eval mode: 0                                                         
Total FLOPs: 0                                                                  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Sanity Checking: |          | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 68.34it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

                                                                           
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py:317: The number of training batches (16) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

Training: |          | 0/? [00:00<?, ?it/s]
Training: |          | 0/? [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.05it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.00it/s, v_num=0]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 78.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 76.37it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 75.45it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.54it/s]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.02it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.01it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]         
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.51it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.44it/s, v_num=0, val_loss=2.180, val_acc=0.347]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 68.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 70.67it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 70.99it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 71.94it/s]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]         
Epoch 2:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.55it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.49it/s, v_num=0, val_loss=2.000, val_acc=0.423]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 75.61it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 73.06it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 73.20it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.65it/s]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.92it/s, v_num=0, val_loss=1.790, val_acc=0.536]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.91it/s, v_num=0, val_loss=1.790, val_acc=0.536]`Trainer.fit` stopped: `max_epochs=3` reached.

Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.88it/s, v_num=0, val_loss=1.790, val_acc=0.536]

Job has completed: 'SUCCEEDED'
PASSED

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

codecov · 2026-02-09T07:57:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 95.97%. Comparing base (3153fe9) to head (48302fe).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1001      +/-   ##
==========================================
+ Coverage   95.96%   95.97%   +0.01%     
==========================================
  Files          23       23              
  Lines        2203     2211       +8     
==========================================
+ Hits         2114     2122       +8     
  Misses         89       89

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pawelpaszki · 2026-02-09T16:36:49Z

/retest

…ainer accelerator

openshift-ci-robot · 2026-02-10T10:48:21Z

@pawelpaszki: This pull request references RHOAIENG-48973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Issue link

RHOAIENG-48973

What changes have been made

fix: widget notebook

Verification steps

Checks

I've made sure the tests are passing.

Testing Strategy

Unit tests

Manual tests

Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-02-10T10:51:35Z

@pawelpaszki: This pull request references RHOAIENG-48973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Issue link

RHOAIENG-48973

What changes have been made

RHOAIENG-48973: fix notebook tests and add required auto param for trainer accelerator

Verification steps

for notebook tests - all should pass
for accelerator required param:

tests that require submission of a job without accelerator parameter provided will fail (upgrade and mnist job submit), e.g.

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,342 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,343 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,346 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,346 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,346 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,349 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 08:58:27,993	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_d9977c9a7d4c9bd8.zip.
2026-02-10 08:58:27,997	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_2QzNnqbaezMfH6cZ
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
RUNNING
RUNNING
2026-02-10 08:58:31,233	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_2QzNnqbaezMfH6cZ: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  None
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
Traceback (most recent call last):
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/working_dir_files/_ray_pkg_d9977c9a7d4c9bd8/mnist.py", line 245, in <module>
   trainer = Trainer(
             ^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
   return fn(self, **kwargs)
          ^^^^^^^^^^^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 417, in __init__
   self._accelerator_connector = _AcceleratorConnector(
                                 ^^^^^^^^^^^^^^^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 129, in __init__
   self._check_config_and_set_final_flags(
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in _check_config_and_set_final_flags
   raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cuda, tpu, cpu, mps.

Job has completed: 'FAILED'
FAILED

with parameter provided for accelerator, the tests pass:

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,920 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,921 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,924 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,925 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,925 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,928 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 09:09:18,760	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_be21486f12aefa83.zip.
2026-02-10 09:09:18,763	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_FjnqSyCQ43zMrMxM
PENDING
RUNNING
RUNNING
2026-02-10 09:09:21,594	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_FjnqSyCQ43zMrMxM: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  auto
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Downloading MNIST dataset...
Using default MNIST mirror reference to download datasets...

 0%|          | 0.00/9.91M [00:00<?, ?B/s]
 1%|          | 98.3k/9.91M [00:00<00:11, 854kB/s]
 4%|▎         | 360k/9.91M [00:00<00:05, 1.68MB/s]
15%|█▍        | 1.44M/9.91M [00:00<00:01, 5.51MB/s]
59%|█████▉    | 5.83M/9.91M [00:00<00:00, 18.7MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 18.8MB/s]

 0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 497kB/s]

 0%|          | 0.00/1.65M [00:00<?, ?B/s]
 6%|▌         | 98.3k/1.65M [00:00<00:01, 842kB/s]
14%|█▍        | 229k/1.65M [00:00<00:01, 1.01MB/s]
44%|████▎     | 721k/1.65M [00:00<00:00, 2.47MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.03MB/s]

 0%|          | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 15.3MB/s]
┏━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃   ┃ Name          ┃ Type               ┃ Params ┃ Mode  ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ model         │ Sequential         │ 55.1 K │ train │     0 │
│ 1 │ val_accuracy  │ MulticlassAccuracy │      0 │ train │     0 │
│ 2 │ test_accuracy │ MulticlassAccuracy │      0 │ train │     0 │
└───┴───────────────┴────────────────────┴────────┴───────┴───────┘
Trainable params: 55.1 K                                                        
Non-trainable params: 0                                                         
Total params: 55.1 K                                                            
Total estimated model params size (MB): 0                                       
Modules in train mode: 11                                                       
Modules in eval mode: 0                                                         
Total FLOPs: 0                                                                  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Sanity Checking: |          | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 68.34it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

                                                                          
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py:317: The number of training batches (16) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

Training: |          | 0/? [00:00<?, ?it/s]
Training: |          | 0/? [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.05it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.00it/s, v_num=0]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 78.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 76.37it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 75.45it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.54it/s]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.02it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.01it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]         
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.51it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.44it/s, v_num=0, val_loss=2.180, val_acc=0.347]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 68.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 70.67it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 70.99it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 71.94it/s]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]         
Epoch 2:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.55it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.49it/s, v_num=0, val_loss=2.000, val_acc=0.423]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 75.61it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 73.06it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 73.20it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.65it/s]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.92it/s, v_num=0, val_loss=1.790, val_acc=0.536]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.91it/s, v_num=0, val_loss=1.790, val_acc=0.536]`Trainer.fit` stopped: `max_epochs=3` reached.

Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.88it/s, v_num=0, val_loss=1.790, val_acc=0.536]

Job has completed: 'SUCCEEDED'
PASSED

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-02-10T10:51:39Z

@pawelpaszki: This pull request references RHOAIENG-48973 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Issue link

RHOAIENG-48973

What changes have been made

RHOAIENG-48973: fix notebook tests and add required auto param for trainer accelerator

Verification steps

for notebook tests - all should pass
for accelerator required param:

tests that require submission of a job without accelerator parameter provided will fail (upgrade and mnist job submit), e.g.

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,342 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,343 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,346 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 08:58:20,346 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 08:58:20,346 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 08:58:20,349 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 08:58:27,993	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_d9977c9a7d4c9bd8.zip.
2026-02-10 08:58:27,997	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_2QzNnqbaezMfH6cZ
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
PENDING
RUNNING
RUNNING
2026-02-10 08:58:31,233	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_2QzNnqbaezMfH6cZ: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  None
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
Traceback (most recent call last):
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/working_dir_files/_ray_pkg_d9977c9a7d4c9bd8/mnist.py", line 245, in <module>
   trainer = Trainer(
             ^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/utilities/argparse.py", line 70, in insert_env_defaults
   return fn(self, **kwargs)
          ^^^^^^^^^^^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/trainer.py", line 417, in __init__
   self._accelerator_connector = _AcceleratorConnector(
                                 ^^^^^^^^^^^^^^^^^^^^^^
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 129, in __init__
   self._check_config_and_set_final_flags(
 File "/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py", line 204, in _check_config_and_set_final_flags
   raise ValueError(
ValueError: You selected an invalid accelerator name: `accelerator=None`. Available names are: auto, cuda, tpu, cpu, mps.

Job has completed: 'FAILED'
FAILED

with parameter provided for accelerator, the tests pass:

tests/upgrade/01_raycluster_sdk_upgrade_test.py::TestMnistJobSubmit::test_mnist_job_submission Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,920 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,921 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,924 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Using kube-authkit auto-detection for authentication
Using kube-authkit auto-detection for authentication
Setting up authentication with kube-authkit auto-detection...
2026-02-10 09:09:10,925 Using authentication strategy: Kubernetes KubeConfig (/tmp/kubeconfig-1)
2026-02-10 09:09:10,925 Authenticating using kubeconfig: /tmp/kubeconfig-1
2026-02-10 09:09:10,928 Successfully authenticated using kubeconfig
✅ Successfully set up authentication with kube-authkit
Yaml resources loaded for mnist
No BYOIDC OIDC providers found in cluster Authentication resource
No BYOIDC OIDC providers found in cluster Authentication resource
Using legacy authentication for Ray Dashboard job submission...
2026-02-10 09:09:18,760	INFO dashboard_sdk.py:355 -- Uploading package gcs://_ray_pkg_be21486f12aefa83.zip.
2026-02-10 09:09:18,763	INFO packaging.py:588 -- Creating a file package for local module './tests/e2e/'.
Submitted job with ID: raysubmit_FjnqSyCQ43zMrMxM
PENDING
RUNNING
RUNNING
2026-02-10 09:09:21,594	INFO job_manager.py:568 -- Runtime env is setting up.
Running entrypoint for job raysubmit_FjnqSyCQ43zMrMxM: python mnist.py
prior to running the trainer
MASTER_ADDR: is  None
MASTER_PORT: is  None
ACCELERATOR: is  auto
STORAGE_BUCKET_EXISTS:  False



GROUP:  1
LOCAL:  1
💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

Downloading MNIST dataset...
Using default MNIST mirror reference to download datasets...

 0%|          | 0.00/9.91M [00:00<?, ?B/s]
 1%|          | 98.3k/9.91M [00:00<00:11, 854kB/s]
 4%|▎         | 360k/9.91M [00:00<00:05, 1.68MB/s]
15%|█▍        | 1.44M/9.91M [00:00<00:01, 5.51MB/s]
59%|█████▉    | 5.83M/9.91M [00:00<00:00, 18.7MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 18.8MB/s]

 0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 497kB/s]

 0%|          | 0.00/1.65M [00:00<?, ?B/s]
 6%|▌         | 98.3k/1.65M [00:00<00:01, 842kB/s]
14%|█▍        | 229k/1.65M [00:00<00:01, 1.01MB/s]
44%|████▎     | 721k/1.65M [00:00<00:00, 2.47MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 4.03MB/s]

 0%|          | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 15.3MB/s]
┏━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃   ┃ Name          ┃ Type               ┃ Params ┃ Mode  ┃ FLOPs ┃
┡━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ 0 │ model         │ Sequential         │ 55.1 K │ train │     0 │
│ 1 │ val_accuracy  │ MulticlassAccuracy │      0 │ train │     0 │
│ 2 │ test_accuracy │ MulticlassAccuracy │      0 │ train │     0 │
└───┴───────────────┴────────────────────┴────────┴───────┴───────┘
Trainable params: 55.1 K                                                        
Non-trainable params: 0                                                         
Total params: 55.1 K                                                            
Total estimated model params size (MB): 0                                       
Modules in train mode: 11                                                       
Modules in eval mode: 0                                                         
Total FLOPs: 0                                                                  

Sanity Checking: |          | 0/? [00:00<?, ?it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.

Sanity Checking: |          | 0/? [00:00<?, ?it/s]
Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 68.34it/s]/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

                                                                          
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:434: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/tmp/ray/session_2026-02-10_08-49-19_805412_1/runtime_resources/pip/890d6800adb59f0d7b5775a3f71b22405ddd42a1/virtualenv/lib64/python3.12/site-packages/pytorch_lightning/loops/fit_loop.py:317: The number of training batches (16) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

Training: |          | 0/? [00:00<?, ?it/s]
Training: |          | 0/? [00:00<?, ?it/s]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.05it/s]
Epoch 0: 100%|██████████| 16/16 [00:00<00:00, 67.00it/s, v_num=0]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 78.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 76.37it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 75.45it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.54it/s]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.02it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0: 100%|██████████| 16/16 [00:01<00:00, 12.01it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 0:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]         
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.51it/s, v_num=0, val_loss=2.180, val_acc=0.347]
Epoch 1: 100%|██████████| 16/16 [00:00<00:00, 67.44it/s, v_num=0, val_loss=2.180, val_acc=0.347]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 68.90it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 70.67it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 70.99it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 71.94it/s]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1: 100%|██████████| 16/16 [00:01<00:00, 11.84it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 1:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]         
Epoch 2:   0%|          | 0/16 [00:00<?, ?it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.55it/s, v_num=0, val_loss=2.000, val_acc=0.423]
Epoch 2: 100%|██████████| 16/16 [00:00<00:00, 63.49it/s, v_num=0, val_loss=2.000, val_acc=0.423]

Validation: |          | 0/? [00:00<?, ?it/s]
Validation: |          | 0/? [00:00<?, ?it/s]
Validation DataLoader 0:   0%|          | 0/79 [00:00<?, ?it/s]
Validation DataLoader 0:  25%|██▌       | 20/79 [00:00<00:00, 75.61it/s]
Validation DataLoader 0:  51%|█████     | 40/79 [00:00<00:00, 73.06it/s]
Validation DataLoader 0:  76%|███████▌  | 60/79 [00:00<00:00, 73.20it/s]
Validation DataLoader 0: 100%|██████████| 79/79 [00:01<00:00, 73.65it/s]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.92it/s, v_num=0, val_loss=1.790, val_acc=0.536]
Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.91it/s, v_num=0, val_loss=1.790, val_acc=0.536]`Trainer.fit` stopped: `max_epochs=3` reached.

Epoch 2: 100%|██████████| 16/16 [00:01<00:00, 11.88it/s, v_num=0, val_loss=1.790, val_acc=0.536]

Job has completed: 'SUCCEEDED'
PASSED

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

chipspeak

LGTM Pawel, cheers for this!

openshift-ci · 2026-02-10T14:34:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chipspeak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [chipspeak]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci Bot requested review from chipspeak and laurafitzgerald February 9, 2026 07:55

pawelpaszki added test-guided-notebooks Run PR check to verify Guided notebooks test-ui-notebooks Run PR check to verify UI notebooks test-additional-notebooks labels Feb 9, 2026

pawelpaszki force-pushed the RHOAIENG-48973 branch from f996c92 to 3699117 Compare February 9, 2026 09:29

pawelpaszki force-pushed the RHOAIENG-48973 branch from 53ee70e to 8d5a23d Compare February 9, 2026 16:42

RHOAIENG-48973: fix notebook tests and add required auto param for tr…

48302fe

…ainer accelerator

pawelpaszki force-pushed the RHOAIENG-48973 branch from 23ea241 to 48302fe Compare February 10, 2026 10:21

pawelpaszki changed the title ~~fix: widget notebook~~ RHOAIENG-48973: fix notebook tests and add required auto param for trainer accelerator Feb 10, 2026

openshift-ci-robot added the jira/valid-reference label Feb 10, 2026

chipspeak approved these changes Feb 10, 2026

View reviewed changes

openshift-ci Bot assigned chipspeak Feb 10, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 10, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2026

openshift-merge-bot Bot merged commit e3aecbb into project-codeflare:main Feb 10, 2026
26 of 34 checks passed

Conversation

pawelpaszki commented Feb 9, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

codecov Bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pawelpaszki commented Feb 9, 2026

Uh oh!

openshift-ci-robot commented Feb 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

openshift-ci-robot commented Feb 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

openshift-ci-robot commented Feb 10, 2026 • edited by openshift-ci Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue link

What changes have been made

Verification steps

Checks

Uh oh!

chipspeak left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Feb 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pawelpaszki commented Feb 9, 2026 •

edited by openshift-ci Bot

Loading

codecov Bot commented Feb 9, 2026 •

edited

Loading

openshift-ci-robot commented Feb 10, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 10, 2026 •

edited by openshift-ci Bot

Loading

openshift-ci-robot commented Feb 10, 2026 •

edited by openshift-ci Bot

Loading