Skip to content

stabilize GHA test workflows and use local checkpoints#4042

Merged
copybara-service[bot] merged 1 commit into
mainfrom
fix-gha-nvidia-find
Jun 2, 2026
Merged

stabilize GHA test workflows and use local checkpoints#4042
copybara-service[bot] merged 1 commit into
mainfrom
fix-gha-nvidia-find

Conversation

@darisoy
Copy link
Copy Markdown
Collaborator

@darisoy darisoy commented Jun 2, 2026

Description

This PR stabilizes the GitHub Actions (GHA) test workflow by fixing two critical failures in the CI/CD pipeline:

  1. Workflow Crash on GPU Image Testing: Prevents the test runner from crashing when the local virtual environment (.venv) is missing (e.g., when testing pre-installed packages in the Docker image).
  2. GCS Checkpoint Write Failures: Fixes a FileNotFoundError on run_2_metrics.txt in checkpoint integration tests by forcing checkpoints to be written to a local temporary directory instead of Google Cloud Storage (GCS).

Why these changes are being made & The problems solved:

1. GPU Library Discovery Crash

  • Context: To avoid conflicts with system-level CUDA libraries on GHA GPU runners, the workflow dynamically discovers pip-installed nvidia sub-libraries inside .venv/lib/ and prepends them to LD_LIBRARY_PATH using find .venv/lib/.
  • The Problem: When running tests in "pre-installed image mode" (maxtext_installed=true), the .venv directory is never created. Because GHA executes bash steps with set -e, running find on the non-existent .venv/lib/ directory returned exit code 1, immediately crashing the entire "Run Tests" step before tests could begin.
  • The Solution: Wrap the discovery block in a directory check (if [ -d ".venv/lib" ]). This safely skips the path override in image-testing mode where system/image-level libraries are already correctly configured, while preserving it for local virtual environment runs.

2. GCS Checkpoint Write Failure

  • Context: Checkpoint integration tests (checkpointing_test.py and checkpoint_compatibility_test.py) verify that saving and restoring checkpoints works across different steps and input pipelines.
  • The Problem: By default, these tests were configured to write checkpoints to the cloud bucket gs://runner-maxtext-logs. GHA runners do not have write access to this bucket. When the training run attempted to save a checkpoint, it failed and raised a StopTraining exception. Because this exception is caught and swallowed by the training loop's graceful exit logic, the run terminated early (at step 0) before writing step metrics to run_2_metrics.txt, causing the test to fail with a FileNotFoundError.
  • The Solution: We updated get_checkpointing_command to accept an optional base_output_directory parameter, and modified both checkpointing integration tests to write their checkpoints locally to /tmp/maxtext_local_output. This makes the tests hermetic, faster, and completely independent of external cloud write permissions.

Tests

Maxtext Package Test workflow is passing: https://github.com/AI-Hypercomputer/maxtext/actions/runs/26788575971

Build Images workflow is passing: https://github.com/AI-Hypercomputer/maxtext/actions/runs/26837419916

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@darisoy darisoy force-pushed the fix-gha-nvidia-find branch from d5130c5 to 96c54d4 Compare June 2, 2026 20:18
- Wrap library path discovery in directory check to prevent crash in image-testing mode (when .venv is missing).

- Force checkpoint integration tests to use local /tmp directory to avoid GCS permission failures on GHA runners.
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@copybara-service copybara-service Bot merged commit d8763ef into main Jun 2, 2026
47 checks passed
@copybara-service copybara-service Bot deleted the fix-gha-nvidia-find branch June 2, 2026 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants