stabilize GHA test workflows and use local checkpoints#4042
Merged
Conversation
d5130c5 to
96c54d4
Compare
- Wrap library path discovery in directory check to prevent crash in image-testing mode (when .venv is missing). - Force checkpoint integration tests to use local /tmp directory to avoid GCS permission failures on GHA runners.
96c54d4 to
eaf7dd6
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
bvandermoon
approved these changes
Jun 2, 2026
4 tasks
Shuwen-Fang
approved these changes
Jun 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR stabilizes the GitHub Actions (GHA) test workflow by fixing two critical failures in the CI/CD pipeline:
.venv) is missing (e.g., when testing pre-installed packages in the Docker image).FileNotFoundErroronrun_2_metrics.txtin checkpoint integration tests by forcing checkpoints to be written to a local temporary directory instead of Google Cloud Storage (GCS).Why these changes are being made & The problems solved:
1. GPU Library Discovery Crash
nvidiasub-libraries inside.venv/lib/and prepends them toLD_LIBRARY_PATHusingfind .venv/lib/.maxtext_installed=true), the.venvdirectory is never created. Because GHA executes bash steps withset -e, runningfindon the non-existent.venv/lib/directory returned exit code1, immediately crashing the entire "Run Tests" step before tests could begin.if [ -d ".venv/lib" ]). This safely skips the path override in image-testing mode where system/image-level libraries are already correctly configured, while preserving it for local virtual environment runs.2. GCS Checkpoint Write Failure
checkpointing_test.pyandcheckpoint_compatibility_test.py) verify that saving and restoring checkpoints works across different steps and input pipelines.gs://runner-maxtext-logs. GHA runners do not have write access to this bucket. When the training run attempted to save a checkpoint, it failed and raised aStopTrainingexception. Because this exception is caught and swallowed by the training loop's graceful exit logic, the run terminated early (at step 0) before writing step metrics torun_2_metrics.txt, causing the test to fail with aFileNotFoundError.get_checkpointing_commandto accept an optionalbase_output_directoryparameter, and modified both checkpointing integration tests to write their checkpoints locally to/tmp/maxtext_local_output. This makes the tests hermetic, faster, and completely independent of external cloud write permissions.Tests
Maxtext Package Test workflow is passing: https://github.com/AI-Hypercomputer/maxtext/actions/runs/26788575971
Build Images workflow is passing: https://github.com/AI-Hypercomputer/maxtext/actions/runs/26837419916
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.