Skip to content

RHOAIENG-57679: add checkpointing guided notebook#1064

Open
pawelpaszki wants to merge 3 commits into
project-codeflare:mainfrom
pawelpaszki:RHOAIENG-57679
Open

RHOAIENG-57679: add checkpointing guided notebook#1064
pawelpaszki wants to merge 3 commits into
project-codeflare:mainfrom
pawelpaszki:RHOAIENG-57679

Conversation

@pawelpaszki

@pawelpaszki pawelpaszki commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Issue link

RHOAIENG-57679: add checkpointing guided notebook

What changes have been made

Addition of new guided notebook showing checkpointing functionality

Verification steps

Run the notebook on a RHOAI cluster and confirm all cells work as expected and ray training can status is saved and then retrieved from an S3 bucket

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

@openshift-ci-robot

openshift-ci-robot commented Apr 23, 2026

Copy link
Copy Markdown
Collaborator

@pawelpaszki: This pull request references RHOAIENG-57679 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Issue link

RHOAIENG-57679: add checkpointing guided notebook

What changes have been made

Addition of new guided notebook showing checkpointing functionality

Verification steps

Run the notebook on a RHOAI cluster and confirm all cells work as expected and ray training can status is saved and then retrieved from an S3 bucket

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • Testing is not required for this change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci

openshift-ci Bot commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign laurafitzgerald for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov

codecov Bot commented Apr 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.46%. Comparing base (0de8ae3) to head (3206f56).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1064      +/-   ##
==========================================
- Coverage   96.62%   96.46%   -0.16%     
==========================================
  Files          23       23              
  Lines        2309     2318       +9     
==========================================
+ Hits         2231     2236       +5     
- Misses         78       82       +4     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@chipspeak

Copy link
Copy Markdown
Contributor

Don't know why the CI is failing but I assume it should pass once the branch is up to date. Otherwise looks good to me! I assume we'll want to run notebook tests here too? @pawelpaszki

@kryanbeane kryanbeane left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly just questions! Other than those this looks great

Comment thread demo-notebooks/guided-demos/train_with_checkpoints.py Outdated
Comment thread demo-notebooks/guided-demos/train_with_checkpoints.py
Comment thread demo-notebooks/guided-demos/6_rayjob_checkpointing_example.ipynb
@pawelpaszki pawelpaszki added test-guided-notebooks Run PR check to verify Guided notebooks test-ui-notebooks Run PR check to verify UI notebooks test-additional-notebooks labels Jun 9, 2026
"source": [
"import urllib3\n",
"\n",
"urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this line before merging?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe I addressed all your comments. can you take another look please?

"cell_type": "markdown",
"metadata": {},
"source": [
"## Red Hat build of Kueue (required before submit)\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this is covered as optional and for admins in the blog do we need to include it here>

"source": [
"# Set your AWS credentials\n",
"# WARNING: Do not commit credentials to version control. For production,\n",
"# use OpenShift AI Data Connections or OpenShift Secrets instead.\n",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same feedback here as per blog

"\n",
"After logs show at least **one epoch** and a checkpoint written to **S3**, suspend the RayJob. This is a **manual** suspend for the demo (distinct from Kueue holding the job until admission right after submit).\n",
"\n",
"Use **Pause** in the OpenShift AI UI, or run the next cell (`job.stop()`)."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to easily build the URL for the user?

Comment thread demo-notebooks/guided-demos/6_rayjob_checkpointing_example.ipynb Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference test-additional-notebooks test-guided-notebooks Run PR check to verify Guided notebooks test-ui-notebooks Run PR check to verify UI notebooks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants