Skip to content

Enable Modal memory snapshots on household API worker#1528

Merged
hua7450 merged 5 commits into
mainfrom
enable-modal-memory-snapshot
May 17, 2026
Merged

Enable Modal memory snapshots on household API worker#1528
hua7450 merged 5 commits into
mainfrom
enable-modal-memory-snapshot

Conversation

@hua7450
Copy link
Copy Markdown
Collaborator

@hua7450 hua7450 commented May 15, 2026

Fixes #1527

Summary

Enables Modal memory snapshots on the household API worker to drop cold-start latency from ~45s to a few seconds. Implements Option B from the issue's implementation-choice comment — convert the worker to a @app.cls with an @modal.enter(snap=True) hook so the heavy Flask import runs at snapshot creation time, not on every cold start.

Changes

  • policyengine_household_api/modal_release/worker_app.py — convert handle_household_request from a top-level @app.function to a method on a HouseholdWorker class decorated with @app.cls. The new load_flask_app method uses @modal.enter(snap=True) so the policyengine_household_api.api import (which pulls in policyengine_us, policyengine_uk, etc.) executes once at snapshot creation. Add enable_memory_snapshot=True to worker_function_options so it applies to all Modal environments.
  • policyengine_household_api/modal_release/gateway.pycall_worker_function now resolves the worker via modal.Cls.from_name(app_name, "HouseholdWorker") and invokes .handle_household_request.remote(payload).
  • docs/engineering/skills/modal-release-prs.md — update the Request Routing section to refer to HouseholdWorker.handle_household_request rather than the old bare function name.
  • tests/unit/modal_release/test_worker_app.py — add two tests: one asserts enable_memory_snapshot=True is set for every Modal environment, one verifies the class exposes both the snapshot hook and the method entrypoint.
  • changelog.d/1527.changed.md — Towncrier fragment.

Why a class instead of just adding the flag to the function?

enable_memory_snapshot=True only captures whatever's in memory at the time Modal takes the snapshot. With the previous shape:

def handle_household_request(payload):
    configure_google_credentials()
    from policyengine_household_api.api import app as flask_app   # heavy import, deferred
    return dispatch_to_flask_app(flask_app, payload)

the heavy import was deferred until first invocation, so Modal would have snapshotted a near-empty state and the first call on every cold container would still pay ~45s. Hoisting the import to module top-level would technically work but mixes "what's snapshotted" with "what runs per call" in a way that's easy to break. The Modal-idiomatic pattern — @app.cls + @modal.enter(snap=True) — makes the boundary explicit: anything assigned to self inside the snap hook is captured in the snapshot, anything in @modal.method bodies runs per invocation. configure_google_credentials() stays per-call (it should be), the Flask app load goes into the snapshot (it should).

Expected impact

Metric Before After
Cold-start init time ~45s (full import + CountryTaxBenefitSystem() chain) ~2-5s (memory snapshot restore)
Worst-case observed wall-clock for a partner-facing cold request 85.8s (real gateway log measurement from 2026-05-15) ~5-10s expected
Concurrency-burst tail (P99 in scale-out events) up to ~56s (real log measurement) ~5-10s expected
Storage cost n/a ~$0.30-1/month for the snapshot
Steady-state compute spend baseline strictly lower (no repeated 45s import billing)

Complements (does not overlap with) #1525, which keeps a warm pool of 3 containers in production. This PR makes the cold starts that do still happen — concurrency bursts, deploys, staging, after scale-down — fast.

Local checks

  • make format-check → clean (128 files already formatted).
  • Full unit test suite (.github/scripts, tests/to_refactor, tests/unit) → 419 passed.
  • The two new tests in tests/unit/modal_release/test_worker_app.py pass.

Verification plan after merge (staging-first)

  1. Deploy to the staging Modal environment.
  2. Wait for the worker app to scale to zero (modal app list shows 0 containers).
  3. Stream the worker log: modal app logs <worker-app-name> --timestamps.
  4. Fire one calculate request.
  5. Confirm:
    • Worker log shows snapshot restore (a single short line) rather than the full authlib -> Initialising API -> API initialised sequence.
    • Gateway log shows the first request's duration: in single-digit seconds, not 40-85s.
    • Parity against GCP household-api remains exact (no numerical regression from frozen snapshot state).
  6. If staging looks good, promote to main per the usual release workflow.

Out of scope

Test plan

  • Lint passes (make format-check)
  • Full unit test suite passes locally
  • CI passes on this PR
  • Staging deploy verifies cold-start latency drop (manual, post-merge)
  • GCP parity test (sequential + burst) confirms no numerical regression on snapshotted state

hua7450 and others added 5 commits May 15, 2026 14:40
Convert `handle_household_request` from a top-level Modal function to a
`HouseholdWorker` class with `@modal.enter(snap=True)` so the heavy
Flask app import (which pulls in policyengine country packages) runs at
snapshot creation time. Subsequent cold starts restore the post-import
state from the snapshot in seconds rather than re-running the ~45s
import chain on every fresh container.

Update the gateway dispatch site to look up the class via
`modal.Cls.from_name` and the doc reference accordingly.

Add tests asserting `enable_memory_snapshot=True` is set in every Modal
environment and that the class exposes both the snapshot hook and the
method entrypoint.

Fixes #1527

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`policyengine_household_api.api` imports run
`initialize_analytics_db_if_enabled` at module level, which opens a Cloud
SQL connection via the Google Cloud SQL Connector in environments where
analytics is enabled. That connector authenticates from
GOOGLE_APPLICATION_CREDENTIALS, set by `configure_google_credentials()`.

The previous shape ran the credentials configuration before the deferred
import on every cold start. Hoisting the import into the snapshot hook
also requires hoisting the credentials configuration, otherwise the
worker can fail at snapshot-time before any request method runs.

Keep an idempotent call in `handle_household_request` so a snapshot-
restored container re-establishes the credentials file on disk if Modal
does not preserve /tmp filesystem state across snapshots; the function
short-circuits when the env var is already set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes flagged by review on #1528:

1. Gateway backward compatibility during the release transition.
   The default Modal release flow promotes the existing frontier worker
   to current without redeploying it. With this PR shipped, the new
   worker exposes the `HouseholdWorker` class but the promoted current
   worker still only exposes the pre-PR top-level
   `handle_household_request` function. `call_worker_function` now tries
   `modal.Cls.from_name(..., "HouseholdWorker")` first and falls back to
   `modal.Function.from_name(..., "handle_household_request")` on
   `modal.exception.NotFoundError`, so both shapes route correctly
   during the one release cycle where both coexist. Added two gateway
   unit tests covering the class path and the function-fallback path.

2. Reset live network state captured by the memory snapshot.
   Memory snapshots preserve Python object state but not live TCP
   sockets; the SQLAlchemy pool and the Cloud SQL Connector created at
   snapshot time hold sockets that closed when Modal froze the
   container. Add `@modal.enter(snap=False) reset_post_snapshot_state`
   on `HouseholdWorker` to run on every snapshot-restored container
   start: re-establish the credentials file in case `/tmp` was not
   preserved, then call `analytics_setup.cleanup()` (drops the Cloud SQL
   Connector singleton) and `db.engine.dispose()` (drops the SQLAlchemy
   connection pool). Subsequent queries open fresh connections. Added a
   unit test asserting the hook is declared on the class.

References:
- https://modal.com/docs/guide/memory-snapshot
- https://modal.com/blog/mem-snapshots

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…napshot

# Conflicts:
#	tests/unit/modal_release/test_gateway.py
#	tests/unit/modal_release/test_worker_app.py
Modal preserves env vars across snapshot restore but not /tmp.
configure_google_credentials() short-circuits when
GOOGLE_APPLICATION_CREDENTIALS is set, so the env var survived
the snapshot while the file at that path may not, leaving Cloud
SQL Connector reads pointing at a missing credentials file.

Pop the env var before calling configure_google_credentials()
in the post-restore reset hook so the credentials file is
rewritten on every container start.

Addresses P1 review feedback on PR #1528.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hua7450 hua7450 marked this pull request as ready for review May 17, 2026 14:49
@hua7450 hua7450 merged commit f95c85c into main May 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enable memory snapshots on Modal worker to fix ~45s cold starts

1 participant