Skip to content

Rework CI into tiered fast / medium / slow lanes with Modal offload#4801

Open
safoinme wants to merge 64 commits into
developfrom
feature/new-ci-architecture
Open

Rework CI into tiered fast / medium / slow lanes with Modal offload#4801
safoinme wants to merge 64 commits into
developfrom
feature/new-ci-architecture

Conversation

@safoinme
Copy link
Copy Markdown
Contributor

@safoinme safoinme commented May 6, 2026

Summary

Replaces the old two-tier CI (ci-fast + label-gated ci-slow) with a four-stage architecture aligned to the PR → merge-queue → develop lifecycle, and moves the bulk of unit/integration test execution off GitHub runners onto Modal sandboxes ("offload") for parallelism and speed. Also adds Python 3.13 support across the matrix and a develop-health gate for the merge queue.

The architecture is documented in .github/CLAUDE.md (updated in this PR) — that's the canonical reference for reviewers.

CI architecture (new)

Workflow Trigger Scope
ci-fast.yml every PR, merge_group, push to develop SQLite migrations, static checks (spellcheck/Ruff/pydoclint, Py 3.13), fast unit + non-slow integration via offload, separate Modal MySQL offload lane, API-docs build
ci-medium.yml merge_group, manual random-DB migrations, Py 3.13 lint + unit, Modal-hosted MySQL integration
ci-slow-develop.yml schedule / manual / workflow_call full qualification matrix: multi-OS (Ubuntu/Win/macOS), full migrations (MySQL/MariaDB/SQLite), tutorial + base-package tests. Renamed from ci-slow.yml; intentionally does not cancel in-progress runs
slow-ci-on-pr.yml run-slow-ci label advisory PR slow CI — reuses the ci-slow-develop matrix via workflow_call
develop-health-gate.yml merge_group, manual blocks merges if last successful develop qualification is stale (MAX_AGE_HOURS=30, GRACE_AFTER_SCHEDULE_HOURS=6) via scripts/develop_health_gate.py

Modal offload mechanism (the core new piece)

linux-fast-offload.yml drives test execution inside Modal sandboxes built from a dedicated Dockerfile.ci. Configuration lives in offload.toml / offload-modal-server-mysql.toml; the sandbox driver is scripts/ci_modal_mysql_sandbox.py. Supporting helpers in scripts/ci/:

  • compute_offload_cache_keys.py, classify_offload_result.py, normalize_offload_junit.py, export_offload_integration_requirements.py, emit_timing_manifest.py, print_junit_summary.py / print_junit_failures.py, verify_required_jobs.py.

Caching note for reviewers (detailed in .github/CLAUDE.md): three cache families — offload-uv-v1 (runner deps), offload-image-v2 (Modal image metadata, deliberately excludes runtime/CPU/filter fields), offload-junit-v2 (duration seeds only). Restored JUnit XML is a timing seed, not test output, unless newer than the run-start marker.

Python 3.13

Matrix bumped to Py 3.13; migration/dependency overrides in scripts/ci/python313-migration-overrides.txt and python313-pydantic-v2-migration-overrides.txt; scripts/test-migrations.sh updated.

Production code changes (non-CI — please review closely)

Two small src/ changes were needed to make tests pass under the offload sandbox:

  • src/zenml/utils/source_utils.pyset_custom_source_root() now falls back to ENV_ZENML_CUSTOM_SOURCE_ROOT. Needed because offload sandboxes run code from /app (ZENML_CUSTOM_SOURCE_ROOT=/app is set in offload.toml).
  • src/zenml/zen_stores/sql_zen_store.py — wait/resume root_run_id is now read via a direct SELECT instead of the ORM relationship (updated_schema.run.root_run_id), avoiding a stale-relationship read. Surfaced by the new parallel test coverage. Worth a careful look since it touches run-resume logic.

Tests & harness

  • New unit coverage for every scripts/ci/* helper under tests/unit/scripts/ci/ and harness behavior in tests/unit/test_harness_*.py.
  • Harness updates for Dockerized MySQL/MariaDB deployments + new cfg/deployments.yaml / cfg/environments.yaml entries.
  • Test-isolation fixes (tests/integration/conftest.py, CLI test cleanup, active-project/source-root handling) to make tests safe under parallel sandbox execution.

Reviewer attention

  1. The two src/zenml changes above — only production-behavior changes in an otherwise-CI PR.
  2. Branch protection / required checks: required-check names changed (job renames, new rollup jobs). Settings must be updated or the merge queue will block — see verify_required_jobs.py and the path-filtering note in .github/CLAUDE.md (required rollups must still report status on docs-only changes).
  3. Modal credentials/secrets must be available to the offload lanes.

Test plan

  • ci-fast green on this PR (fast offload + Modal MySQL lanes)
  • ci-medium green in merge queue
  • One ci-slow-develop run (or run-slow-ci label) passes the full matrix
  • Confirm develop-health-gate behaves on a stale vs. fresh develop
  • Update branch-protection required checks before merge

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • I have added tests to cover my changes.
  • I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
  • IMPORTANT: I made sure that my changes are reflected properly in the following resources:
    • ZenML Docs
    • Dashboard: Needs to be communicated to the frontend team.
    • Templates: Might need adjustments (that are not reflected in the template tests) in case of non-breaking changes and deprecations.
    • Projects: Depending on the version dependencies, different projects might get affected.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

✅ No broken links found!

Comment thread scripts/ci_modal_mysql_sandbox.py Outdated
def _write_output(name: str, value: str) -> None:
"""Write a GitHub Actions output."""
if not GITHUB_OUTPUT:
print(f"{name}={value}")
print(f"{name}={value}")
return
with Path(GITHUB_OUTPUT).open("a", encoding="utf-8") as output_file:
output_file.write(f"{name}={value}\n")
@github-actions github-actions Bot added internal To filter out internal PRs and issues enhancement New feature or request labels May 6, 2026
@socket-security
Copy link
Copy Markdown

socket-security Bot commented May 6, 2026

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedmodal@​1.4.175100100100100
Addedclick@​8.4.096100100100100
Addeddockerfile-parse@​2.0.1100100100100100

View full report

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

Documentation Link Check Results

Absolute links check passed
Relative links check passed
Last checked: 2026-05-22 00:06:00 UTC

@bcdurak bcdurak linked an issue May 8, 2026 that may be closed by this pull request
safoinme added 26 commits May 9, 2026 00:14
…onment variable

refactor: optimize root_run_id retrieval in SqlZenStore class
- Update Docker build command to use dynamic Python version argument.
- Modify integration installation script to skip AzureML for Windows with Python 3.13.
- Adjust test cases to ensure server images match the active CI Python version.
- Change Docker Compose pull policy to 'missing' for better image management.
@safoinme safoinme changed the title New CI Architecture Rework CI into tiered fast / medium / slow lanes with Modal offload May 21, 2026
@safoinme safoinme requested review from bcdurak and strickvl May 21, 2026 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internal To filter out internal PRs and issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Faster CI suite - sub 10min

2 participants