Enable Modal memory snapshots on household API worker#1528
Merged
Conversation
Convert `handle_household_request` from a top-level Modal function to a `HouseholdWorker` class with `@modal.enter(snap=True)` so the heavy Flask app import (which pulls in policyengine country packages) runs at snapshot creation time. Subsequent cold starts restore the post-import state from the snapshot in seconds rather than re-running the ~45s import chain on every fresh container. Update the gateway dispatch site to look up the class via `modal.Cls.from_name` and the doc reference accordingly. Add tests asserting `enable_memory_snapshot=True` is set in every Modal environment and that the class exposes both the snapshot hook and the method entrypoint. Fixes #1527 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`policyengine_household_api.api` imports run `initialize_analytics_db_if_enabled` at module level, which opens a Cloud SQL connection via the Google Cloud SQL Connector in environments where analytics is enabled. That connector authenticates from GOOGLE_APPLICATION_CREDENTIALS, set by `configure_google_credentials()`. The previous shape ran the credentials configuration before the deferred import on every cold start. Hoisting the import into the snapshot hook also requires hoisting the credentials configuration, otherwise the worker can fail at snapshot-time before any request method runs. Keep an idempotent call in `handle_household_request` so a snapshot- restored container re-establishes the credentials file on disk if Modal does not preserve /tmp filesystem state across snapshots; the function short-circuits when the env var is already set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes flagged by review on #1528: 1. Gateway backward compatibility during the release transition. The default Modal release flow promotes the existing frontier worker to current without redeploying it. With this PR shipped, the new worker exposes the `HouseholdWorker` class but the promoted current worker still only exposes the pre-PR top-level `handle_household_request` function. `call_worker_function` now tries `modal.Cls.from_name(..., "HouseholdWorker")` first and falls back to `modal.Function.from_name(..., "handle_household_request")` on `modal.exception.NotFoundError`, so both shapes route correctly during the one release cycle where both coexist. Added two gateway unit tests covering the class path and the function-fallback path. 2. Reset live network state captured by the memory snapshot. Memory snapshots preserve Python object state but not live TCP sockets; the SQLAlchemy pool and the Cloud SQL Connector created at snapshot time hold sockets that closed when Modal froze the container. Add `@modal.enter(snap=False) reset_post_snapshot_state` on `HouseholdWorker` to run on every snapshot-restored container start: re-establish the credentials file in case `/tmp` was not preserved, then call `analytics_setup.cleanup()` (drops the Cloud SQL Connector singleton) and `db.engine.dispose()` (drops the SQLAlchemy connection pool). Subsequent queries open fresh connections. Added a unit test asserting the hook is declared on the class. References: - https://modal.com/docs/guide/memory-snapshot - https://modal.com/blog/mem-snapshots Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…napshot # Conflicts: # tests/unit/modal_release/test_gateway.py # tests/unit/modal_release/test_worker_app.py
Modal preserves env vars across snapshot restore but not /tmp. configure_google_credentials() short-circuits when GOOGLE_APPLICATION_CREDENTIALS is set, so the env var survived the snapshot while the file at that path may not, leaving Cloud SQL Connector reads pointing at a missing credentials file. Pop the env var before calling configure_google_credentials() in the post-restore reset hook so the credentials file is rewritten on every container start. Addresses P1 review feedback on PR #1528. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1527
Summary
Enables Modal memory snapshots on the household API worker to drop cold-start latency from ~45s to a few seconds. Implements Option B from the issue's implementation-choice comment — convert the worker to a
@app.clswith an@modal.enter(snap=True)hook so the heavy Flask import runs at snapshot creation time, not on every cold start.Changes
policyengine_household_api/modal_release/worker_app.py— converthandle_household_requestfrom a top-level@app.functionto a method on aHouseholdWorkerclass decorated with@app.cls. The newload_flask_appmethod uses@modal.enter(snap=True)so thepolicyengine_household_api.apiimport (which pulls inpolicyengine_us,policyengine_uk, etc.) executes once at snapshot creation. Addenable_memory_snapshot=Truetoworker_function_optionsso it applies to all Modal environments.policyengine_household_api/modal_release/gateway.py—call_worker_functionnow resolves the worker viamodal.Cls.from_name(app_name, "HouseholdWorker")and invokes.handle_household_request.remote(payload).docs/engineering/skills/modal-release-prs.md— update the Request Routing section to refer toHouseholdWorker.handle_household_requestrather than the old bare function name.tests/unit/modal_release/test_worker_app.py— add two tests: one assertsenable_memory_snapshot=Trueis set for every Modal environment, one verifies the class exposes both the snapshot hook and the method entrypoint.changelog.d/1527.changed.md— Towncrier fragment.Why a class instead of just adding the flag to the function?
enable_memory_snapshot=Trueonly captures whatever's in memory at the time Modal takes the snapshot. With the previous shape:the heavy import was deferred until first invocation, so Modal would have snapshotted a near-empty state and the first call on every cold container would still pay ~45s. Hoisting the import to module top-level would technically work but mixes "what's snapshotted" with "what runs per call" in a way that's easy to break. The Modal-idiomatic pattern —
@app.cls+@modal.enter(snap=True)— makes the boundary explicit: anything assigned toselfinside the snap hook is captured in the snapshot, anything in@modal.methodbodies runs per invocation.configure_google_credentials()stays per-call (it should be), the Flask app load goes into the snapshot (it should).Expected impact
CountryTaxBenefitSystem()chain)Complements (does not overlap with) #1525, which keeps a warm pool of 3 containers in production. This PR makes the cold starts that do still happen — concurrency bursts, deploys, staging, after scale-down — fast.
Local checks
make format-check→ clean (128 files already formatted)..github/scripts,tests/to_refactor,tests/unit) → 419 passed.tests/unit/modal_release/test_worker_app.pypass.Verification plan after merge (staging-first)
modal app listshows 0 containers).modal app logs <worker-app-name> --timestamps.authlib -> Initialising API -> API initialisedsequence.duration:in single-digit seconds, not 40-85s.mainper the usual release workflow.Out of scope
min_containers/ warm pool (already implemented in Keep production Modal workers warm #1526, addressing Keep production Modal household API workers warm #1525).Test plan
make format-check)