Rework CI into tiered fast / medium / slow lanes with Modal offload#4801
Open
safoinme wants to merge 64 commits into
Open
Rework CI into tiered fast / medium / slow lanes with Modal offload#4801safoinme wants to merge 64 commits into
safoinme wants to merge 64 commits into
Conversation
Contributor
|
✅ No broken links found! |
| def _write_output(name: str, value: str) -> None: | ||
| """Write a GitHub Actions output.""" | ||
| if not GITHUB_OUTPUT: | ||
| print(f"{name}={value}") |
| print(f"{name}={value}") | ||
| return | ||
| with Path(GITHUB_OUTPUT).open("a", encoding="utf-8") as output_file: | ||
| output_file.write(f"{name}={value}\n") |
|
Review the following changes in direct dependencies. Learn more about Socket for GitHub.
|
Contributor
Documentation Link Check Results✅ Absolute links check passed |
…ml into feature/new-ci-architecture
…onment variable refactor: optimize root_run_id retrieval in SqlZenStore class
…ml into feature/new-ci-architecture
- Update Docker build command to use dynamic Python version argument. - Modify integration installation script to skip AzureML for Windows with Python 3.13. - Adjust test cases to ensure server images match the active CI Python version. - Change Docker Compose pull policy to 'missing' for better image management.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the old two-tier CI (
ci-fast+ label-gatedci-slow) with a four-stage architecture aligned to the PR → merge-queue → develop lifecycle, and moves the bulk of unit/integration test execution off GitHub runners onto Modal sandboxes ("offload") for parallelism and speed. Also adds Python 3.13 support across the matrix and a develop-health gate for the merge queue.The architecture is documented in
.github/CLAUDE.md(updated in this PR) — that's the canonical reference for reviewers.CI architecture (new)
ci-fast.ymlci-medium.ymlci-slow-develop.ymlworkflow_callci-slow.yml; intentionally does not cancel in-progress runsslow-ci-on-pr.ymlrun-slow-cilabelci-slow-developmatrix viaworkflow_calldevelop-health-gate.ymlMAX_AGE_HOURS=30,GRACE_AFTER_SCHEDULE_HOURS=6) viascripts/develop_health_gate.pyModal offload mechanism (the core new piece)
linux-fast-offload.ymldrives test execution inside Modal sandboxes built from a dedicatedDockerfile.ci. Configuration lives inoffload.toml/offload-modal-server-mysql.toml; the sandbox driver isscripts/ci_modal_mysql_sandbox.py. Supporting helpers inscripts/ci/:compute_offload_cache_keys.py,classify_offload_result.py,normalize_offload_junit.py,export_offload_integration_requirements.py,emit_timing_manifest.py,print_junit_summary.py/print_junit_failures.py,verify_required_jobs.py.Caching note for reviewers (detailed in
.github/CLAUDE.md): three cache families —offload-uv-v1(runner deps),offload-image-v2(Modal image metadata, deliberately excludes runtime/CPU/filter fields),offload-junit-v2(duration seeds only). Restored JUnit XML is a timing seed, not test output, unless newer than the run-start marker.Python 3.13
Matrix bumped to Py 3.13; migration/dependency overrides in
scripts/ci/python313-migration-overrides.txtandpython313-pydantic-v2-migration-overrides.txt;scripts/test-migrations.shupdated.Production code changes (non-CI — please review closely)
Two small
src/changes were needed to make tests pass under the offload sandbox:src/zenml/utils/source_utils.py—set_custom_source_root()now falls back toENV_ZENML_CUSTOM_SOURCE_ROOT. Needed because offload sandboxes run code from/app(ZENML_CUSTOM_SOURCE_ROOT=/appis set inoffload.toml).src/zenml/zen_stores/sql_zen_store.py— wait/resumeroot_run_idis now read via a directSELECTinstead of the ORM relationship (updated_schema.run.root_run_id), avoiding a stale-relationship read. Surfaced by the new parallel test coverage. Worth a careful look since it touches run-resume logic.Tests & harness
scripts/ci/*helper undertests/unit/scripts/ci/and harness behavior intests/unit/test_harness_*.py.cfg/deployments.yaml/cfg/environments.yamlentries.tests/integration/conftest.py, CLI test cleanup, active-project/source-root handling) to make tests safe under parallel sandbox execution.Reviewer attention
src/zenmlchanges above — only production-behavior changes in an otherwise-CI PR.verify_required_jobs.pyand the path-filtering note in.github/CLAUDE.md(required rollups must still report status on docs-only changes).Test plan
ci-fastgreen on this PR (fast offload + Modal MySQL lanes)ci-mediumgreen in merge queueci-slow-developrun (orrun-slow-cilabel) passes the full matrixdevelop-health-gatebehaves on a stale vs. fresh developPre-requisites
Please ensure you have done the following:
developand the open PR is targetingdevelop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.Types of changes