RHOAIENG-57679: add checkpointing guided notebook#1064
Conversation
|
@pawelpaszki: This pull request references RHOAIENG-57679 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the epic to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1064 +/- ##
==========================================
- Coverage 96.62% 96.46% -0.16%
==========================================
Files 23 23
Lines 2309 2318 +9
==========================================
+ Hits 2231 2236 +5
- Misses 78 82 +4 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Don't know why the CI is failing but I assume it should pass once the branch is up to date. Otherwise looks good to me! I assume we'll want to run notebook tests here too? @pawelpaszki |
kryanbeane
left a comment
There was a problem hiding this comment.
mostly just questions! Other than those this looks great
b68ab80 to
7dd7c5c
Compare
7dd7c5c to
2209b68
Compare
| "source": [ | ||
| "import urllib3\n", | ||
| "\n", | ||
| "urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)\n", |
There was a problem hiding this comment.
Should we remove this line before merging?
There was a problem hiding this comment.
I believe I addressed all your comments. can you take another look please?
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Red Hat build of Kueue (required before submit)\n", |
There was a problem hiding this comment.
Given this is covered as optional and for admins in the blog do we need to include it here>
| "source": [ | ||
| "# Set your AWS credentials\n", | ||
| "# WARNING: Do not commit credentials to version control. For production,\n", | ||
| "# use OpenShift AI Data Connections or OpenShift Secrets instead.\n", |
There was a problem hiding this comment.
same feedback here as per blog
| "\n", | ||
| "After logs show at least **one epoch** and a checkpoint written to **S3**, suspend the RayJob. This is a **manual** suspend for the demo (distinct from Kueue holding the job until admission right after submit).\n", | ||
| "\n", | ||
| "Use **Pause** in the OpenShift AI UI, or run the next cell (`job.stop()`)." |
There was a problem hiding this comment.
Is there any way to easily build the URL for the user?
Co-authored-by: Laura Fitzgerald <lfitzger@redhat.com>
Issue link
RHOAIENG-57679: add checkpointing guided notebook
What changes have been made
Addition of new guided notebook showing checkpointing functionality
Verification steps
Run the notebook on a RHOAI cluster and confirm all cells work as expected and ray training can status is saved and then retrieved from an S3 bucket
Checks