Set Modal worker concurrency to 5 and cap autoscale at 100 by hua7450 · Pull Request #1534 · PolicyEngine/policyengine-household-api

hua7450 · 2026-05-17T16:33:05Z

Summary

Two-setting Modal config tweak that:

Enables allow_concurrent_inputs=5 — each warm container now processes up to 5 requests in parallel instead of 1. Multiplies effective warm-pool capacity 5x with no additional always-on container cost.
Caps autoscale at max_containers=100 — sanity ceiling against runaway scaling. Today the cap is unbounded (workspace quota only), so a buggy client or traffic spike could rack up unbounded cost.

Both settings apply to every Modal environment (main, staging, testing) so concurrency issues surface in staging before production.

Why now

PR #1528 (memory snapshots) just shipped and dropped cold-start latency from 50-105s to ~26s. Burst testing in production today (~/modal_burst_test.py against the live gateway) confirmed the snapshot is working and surfaced these two cheap optimizations.

Why concurrent_inputs=5 (and not 1 or 10)

PolicyEngine calculations are CPU-bound (~3s on 1 core). With N requests sharing the same core, each takes roughly N × 3s wall-time:

`concurrent_inputs`	Per-request latency	Slots per container
1	~3s	1
5	~15s	5
8	~24s	8

5 is the sweet spot: 5x throughput per container, per-request latency stays within typical partner SLAs (<20s), 5 PolicyEngine Simulation objects fit comfortably in the 9.6 GiB container memory.

Capacity before vs after (current `min_containers=3, buffer_containers=2`)

	Before this PR	After this PR
Warm slots ready	5 (1 each on 5 containers)	25 (5 each on 5 containers)
Burst absorbed without cold start	5	25
Per-request latency when slots full	~3s	~15s
Always-on container cost	unchanged	unchanged

Why max_containers=100

During the burst test today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests, but it demonstrated that with no cap, runaway behavior is possible.

100 covers any realistic burst we expect: 100 concurrent × 5 inputs/container = 500 in-flight, all hitting at once. Plus headroom for traffic-shape variance.

Files changed

policyengine_household_api/modal_release/worker_app.py — added allow_concurrent_inputs: 5 and max_containers: 100 to the base options dict (applies to all environments).
tests/unit/modal_release/test_worker_app.py — two new tests asserting both settings are present in main, staging, testing.
changelog.d/1533.changed.md — Towncrier fragment.

Out of scope

Bumping min_containers / buffer_containers. The Layer 1 change here gives 5x capacity for free; we want to see real partner traffic with the new concurrency setting before deciding whether to also pay for more always-on containers (Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533 discussion).
Per-CPU-core tuning. Default Modal allocation is 1 core/container, which is what these latency numbers assume.

Verification plan after merge (staging-first via existing automation)

CI auto-deploys to Modal staging.
Integration tests against staging pass (they cover current + frontier × channel + exact paths — same as PR Enable Modal memory snapshots on household API worker #1528).
Re-run `~/modal_burst_test.py --sizes 5,10,20,25,30,50,100` against staging gateway:
- Expect burst 5-25 to be all warm with p95 around 15s
- Expect burst >25 to have a cold tail but each cold still ~26s (snapshot still works)
If staging looks good, the same workflow promotes to production.
Re-run the same burst test against production for confirmation.

Risks and mitigations

Flask thread-safety: audited in detail before this PR. Per-request state (test_client per call, Simulation per call, household copy per call) is properly isolated. Shared state (TBS, SQLAlchemy session, Limiter, ResourceProtector) is either read-only or thread-safe by design. SQLAlchemy default pool of 5+10 overflow handles 5 concurrent DB writes easily. See discussion in Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533.
Per-request latency increase under load: documented (~3s → ~15s when slots fully saturated). This is a deliberate trade for 5x capacity.
Memory contention: 5 simultaneous PolicyEngine Simulation objects per container. Each ~50-200 MB working state → ~1 GiB peak. Well within 9.6 GiB container budget.
If anything goes wrong: trivial rollback (delete 2 lines from worker_app.py).

Test plan

New unit tests pass locally
make format-check clean
CI passes on this PR
Staging integration tests pass on this PR
Burst test against staging shows expected behavior (p95 ~15s for burst≤25)
Burst test against production after promotion confirms same

hua7450 · 2026-05-17T16:57:39Z

CI failure is unrelated to this PR — upstream package incompatibility.

The Test job fails during conftest import, before any of this PR's test code runs:

ValueError: Variable "afcs" mixes computation modes: adds/subtracts and uprating.
Variables must use at most one of formula, adds/subtracts, or uprating;
plain input or constant variables should use none.

Root cause:

pyproject.toml pins policyengine_uk==2.31.0 (whose gov/dwp/afcs.py mixes adds/subtracts with uprating) but pins policyengine-core>=3.26.5 with no upper bound.
A newer policyengine-core released between PR Enable Modal memory snapshots on household API worker #1528's CI run (passed at 14:37 UTC today) and this PR's run (failed at 16:33 UTC) tightened variable validation to reject this combination.
The country package now refuses to load, so from policyengine_household_api.api import app crashes during conftest.

Nothing in this PR's diff touches policyengine_uk, policyengine-core, or any variable definitions — it only changes Modal worker concurrency options and adds tests for them.

Suggested separate fix (not in scope for this PR):

Cap policyengine-core to the last compatible release in pyproject.toml, OR
Bump policyengine_uk to a version whose afcs.py no longer mixes modes, OR
Fix afcs.py upstream in policyengine_uk.

This PR should be rebased on whichever fix lands first.

Each container now processes up to 5 requests in parallel (allow_concurrent_inputs=5), multiplying effective warm-pool capacity 5x without any additional always-on container cost. PolicyEngine work is CPU-bound (~3s on 1 core), so 5-way concurrency on a single core makes each request take ~15s wall-time when fully saturated. Throughput goes up 5x; per- request latency goes up proportionally. Fair trade for the ~25-concurrent burst absorption it buys. Add max_containers=100 as a sanity cap. Today autoscale is bounded only by the workspace quota, so a buggy partner or runaway loop could scale to hundreds of containers and rack up unbounded cost. 100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom. Both settings apply to all environments so staging behavior mirrors production and concurrency issues surface there. Closes #1533 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Migrate the Modal worker from the legacy `allow_concurrent_inputs=5` kwarg to the current `@modal.concurrent(max_inputs=5, target_inputs=4)` decorator form documented for Modal 1.3.x. `target_inputs=4` tells the autoscaler to aim for 80% steady-state utilisation so each container retains one free slot to absorb a single-request spike without waiting on a cold start, while containers still burst up to `max_inputs=5` under load before queueing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

hua7450 marked this pull request as ready for review May 17, 2026 16:58

anth-volk force-pushed the add-modal-concurrent-inputs-and-max-cap branch from 3615278 to be89367 Compare May 18, 2026 14:14

hua7450 and others added 2 commits May 18, 2026 23:04

anth-volk force-pushed the add-modal-concurrent-inputs-and-max-cap branch from be89367 to e059c85 Compare May 18, 2026 21:07

hua7450 merged commit da2bfcc into main May 18, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set Modal worker concurrency to 5 and cap autoscale at 100#1534

Set Modal worker concurrency to 5 and cap autoscale at 100#1534
hua7450 merged 2 commits into
mainfrom
add-modal-concurrent-inputs-and-max-cap

hua7450 commented May 17, 2026 •

edited by anth-volk

Loading

Uh oh!

hua7450 commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hua7450 commented May 17, 2026 • edited by anth-volk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why now

Why concurrent_inputs=5 (and not 1 or 10)

Capacity before vs after (current min_containers=3, buffer_containers=2)

Why max_containers=100

Files changed

Out of scope

Verification plan after merge (staging-first via existing automation)

Risks and mitigations

Test plan

Uh oh!

hua7450 commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hua7450 commented May 17, 2026 •

edited by anth-volk

Loading

Capacity before vs after (current `min_containers=3, buffer_containers=2`)