Skip to content

Set Modal worker concurrency to 5 and cap autoscale at 100#1534

Merged
hua7450 merged 2 commits into
mainfrom
add-modal-concurrent-inputs-and-max-cap
May 18, 2026
Merged

Set Modal worker concurrency to 5 and cap autoscale at 100#1534
hua7450 merged 2 commits into
mainfrom
add-modal-concurrent-inputs-and-max-cap

Conversation

@hua7450
Copy link
Copy Markdown
Collaborator

@hua7450 hua7450 commented May 17, 2026

Closes #1533

Summary

Two-setting Modal config tweak that:

  1. Enables allow_concurrent_inputs=5 — each warm container now processes up to 5 requests in parallel instead of 1. Multiplies effective warm-pool capacity 5x with no additional always-on container cost.
  2. Caps autoscale at max_containers=100 — sanity ceiling against runaway scaling. Today the cap is unbounded (workspace quota only), so a buggy client or traffic spike could rack up unbounded cost.

Both settings apply to every Modal environment (main, staging, testing) so concurrency issues surface in staging before production.

Why now

PR #1528 (memory snapshots) just shipped and dropped cold-start latency from 50-105s to ~26s. Burst testing in production today (~/modal_burst_test.py against the live gateway) confirmed the snapshot is working and surfaced these two cheap optimizations.

Why concurrent_inputs=5 (and not 1 or 10)

PolicyEngine calculations are CPU-bound (~3s on 1 core). With N requests sharing the same core, each takes roughly N × 3s wall-time:

concurrent_inputs Per-request latency Slots per container
1 ~3s 1
5 ~15s 5
8 ~24s 8

5 is the sweet spot: 5x throughput per container, per-request latency stays within typical partner SLAs (<20s), 5 PolicyEngine Simulation objects fit comfortably in the 9.6 GiB container memory.

Capacity before vs after (current min_containers=3, buffer_containers=2)

Before this PR After this PR
Warm slots ready 5 (1 each on 5 containers) 25 (5 each on 5 containers)
Burst absorbed without cold start 5 25
Per-request latency when slots full ~3s ~15s
Always-on container cost unchanged unchanged

Why max_containers=100

During the burst test today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests, but it demonstrated that with no cap, runaway behavior is possible.

100 covers any realistic burst we expect: 100 concurrent × 5 inputs/container = 500 in-flight, all hitting at once. Plus headroom for traffic-shape variance.

Files changed

  • policyengine_household_api/modal_release/worker_app.py — added allow_concurrent_inputs: 5 and max_containers: 100 to the base options dict (applies to all environments).
  • tests/unit/modal_release/test_worker_app.py — two new tests asserting both settings are present in main, staging, testing.
  • changelog.d/1533.changed.md — Towncrier fragment.

Out of scope

  • Bumping min_containers / buffer_containers. The Layer 1 change here gives 5x capacity for free; we want to see real partner traffic with the new concurrency setting before deciding whether to also pay for more always-on containers (Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533 discussion).
  • Per-CPU-core tuning. Default Modal allocation is 1 core/container, which is what these latency numbers assume.

Verification plan after merge (staging-first via existing automation)

  1. CI auto-deploys to Modal staging.
  2. Integration tests against staging pass (they cover current + frontier × channel + exact paths — same as PR Enable Modal memory snapshots on household API worker #1528).
  3. Re-run `~/modal_burst_test.py --sizes 5,10,20,25,30,50,100` against staging gateway:
    • Expect burst 5-25 to be all warm with p95 around 15s
    • Expect burst >25 to have a cold tail but each cold still ~26s (snapshot still works)
  4. If staging looks good, the same workflow promotes to production.
  5. Re-run the same burst test against production for confirmation.

Risks and mitigations

  • Flask thread-safety: audited in detail before this PR. Per-request state (test_client per call, Simulation per call, household copy per call) is properly isolated. Shared state (TBS, SQLAlchemy session, Limiter, ResourceProtector) is either read-only or thread-safe by design. SQLAlchemy default pool of 5+10 overflow handles 5 concurrent DB writes easily. See discussion in Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533.
  • Per-request latency increase under load: documented (~3s → ~15s when slots fully saturated). This is a deliberate trade for 5x capacity.
  • Memory contention: 5 simultaneous PolicyEngine Simulation objects per container. Each ~50-200 MB working state → ~1 GiB peak. Well within 9.6 GiB container budget.
  • If anything goes wrong: trivial rollback (delete 2 lines from worker_app.py).

Test plan

  • New unit tests pass locally
  • make format-check clean
  • CI passes on this PR
  • Staging integration tests pass on this PR
  • Burst test against staging shows expected behavior (p95 ~15s for burst≤25)
  • Burst test against production after promotion confirms same

@hua7450
Copy link
Copy Markdown
Collaborator Author

hua7450 commented May 17, 2026

CI failure is unrelated to this PR — upstream package incompatibility.

The Test job fails during conftest import, before any of this PR's test code runs:

ValueError: Variable "afcs" mixes computation modes: adds/subtracts and uprating.
Variables must use at most one of formula, adds/subtracts, or uprating;
plain input or constant variables should use none.

Root cause:

  • pyproject.toml pins policyengine_uk==2.31.0 (whose gov/dwp/afcs.py mixes adds/subtracts with uprating) but pins policyengine-core>=3.26.5 with no upper bound.
  • A newer policyengine-core released between PR Enable Modal memory snapshots on household API worker #1528's CI run (passed at 14:37 UTC today) and this PR's run (failed at 16:33 UTC) tightened variable validation to reject this combination.
  • The country package now refuses to load, so from policyengine_household_api.api import app crashes during conftest.

Nothing in this PR's diff touches policyengine_uk, policyengine-core, or any variable definitions — it only changes Modal worker concurrency options and adds tests for them.

Suggested separate fix (not in scope for this PR):

  1. Cap policyengine-core to the last compatible release in pyproject.toml, OR
  2. Bump policyengine_uk to a version whose afcs.py no longer mixes modes, OR
  3. Fix afcs.py upstream in policyengine_uk.

This PR should be rebased on whichever fix lands first.

@hua7450 hua7450 marked this pull request as ready for review May 17, 2026 16:58
@anth-volk anth-volk force-pushed the add-modal-concurrent-inputs-and-max-cap branch from 3615278 to be89367 Compare May 18, 2026 14:14
hua7450 and others added 2 commits May 18, 2026 23:04
Each container now processes up to 5 requests in parallel
(allow_concurrent_inputs=5), multiplying effective warm-pool
capacity 5x without any additional always-on container cost.
PolicyEngine work is CPU-bound (~3s on 1 core), so 5-way
concurrency on a single core makes each request take ~15s
wall-time when fully saturated. Throughput goes up 5x; per-
request latency goes up proportionally. Fair trade for the
~25-concurrent burst absorption it buys.

Add max_containers=100 as a sanity cap. Today autoscale is
bounded only by the workspace quota, so a buggy partner or
runaway loop could scale to hundreds of containers and rack
up unbounded cost. 100 covers any realistic partner burst
(100 concurrent x 5 inputs = 500 in-flight) with headroom.

Both settings apply to all environments so staging behavior
mirrors production and concurrency issues surface there.

Closes #1533

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate the Modal worker from the legacy `allow_concurrent_inputs=5`
kwarg to the current `@modal.concurrent(max_inputs=5, target_inputs=4)`
decorator form documented for Modal 1.3.x.

`target_inputs=4` tells the autoscaler to aim for 80% steady-state
utilisation so each container retains one free slot to absorb a
single-request spike without waiting on a cold start, while containers
still burst up to `max_inputs=5` under load before queueing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anth-volk anth-volk force-pushed the add-modal-concurrent-inputs-and-max-cap branch from be89367 to e059c85 Compare May 18, 2026 21:07
@hua7450 hua7450 merged commit da2bfcc into main May 18, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bump Modal worker concurrency to 5 and cap autoscale at 100 containers

1 participant