Set Modal worker concurrency to 5 and cap autoscale at 100#1534
Merged
Conversation
Collaborator
Author
|
CI failure is unrelated to this PR — upstream package incompatibility. The Root cause:
Nothing in this PR's diff touches Suggested separate fix (not in scope for this PR):
This PR should be rebased on whichever fix lands first. |
3615278 to
be89367
Compare
Each container now processes up to 5 requests in parallel (allow_concurrent_inputs=5), multiplying effective warm-pool capacity 5x without any additional always-on container cost. PolicyEngine work is CPU-bound (~3s on 1 core), so 5-way concurrency on a single core makes each request take ~15s wall-time when fully saturated. Throughput goes up 5x; per- request latency goes up proportionally. Fair trade for the ~25-concurrent burst absorption it buys. Add max_containers=100 as a sanity cap. Today autoscale is bounded only by the workspace quota, so a buggy partner or runaway loop could scale to hundreds of containers and rack up unbounded cost. 100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom. Both settings apply to all environments so staging behavior mirrors production and concurrency issues surface there. Closes #1533 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrate the Modal worker from the legacy `allow_concurrent_inputs=5` kwarg to the current `@modal.concurrent(max_inputs=5, target_inputs=4)` decorator form documented for Modal 1.3.x. `target_inputs=4` tells the autoscaler to aim for 80% steady-state utilisation so each container retains one free slot to absorb a single-request spike without waiting on a cold start, while containers still burst up to `max_inputs=5` under load before queueing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
be89367 to
e059c85
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1533
Summary
Two-setting Modal config tweak that:
allow_concurrent_inputs=5— each warm container now processes up to 5 requests in parallel instead of 1. Multiplies effective warm-pool capacity 5x with no additional always-on container cost.max_containers=100— sanity ceiling against runaway scaling. Today the cap is unbounded (workspace quota only), so a buggy client or traffic spike could rack up unbounded cost.Both settings apply to every Modal environment (
main,staging,testing) so concurrency issues surface in staging before production.Why now
PR #1528 (memory snapshots) just shipped and dropped cold-start latency from 50-105s to ~26s. Burst testing in production today (
~/modal_burst_test.pyagainst the live gateway) confirmed the snapshot is working and surfaced these two cheap optimizations.Why concurrent_inputs=5 (and not 1 or 10)
PolicyEngine calculations are CPU-bound (~3s on 1 core). With N requests sharing the same core, each takes roughly N × 3s wall-time:
concurrent_inputs5 is the sweet spot: 5x throughput per container, per-request latency stays within typical partner SLAs (<20s), 5 PolicyEngine Simulation objects fit comfortably in the 9.6 GiB container memory.
Capacity before vs after (current
min_containers=3, buffer_containers=2)Why max_containers=100
During the burst test today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests, but it demonstrated that with no cap, runaway behavior is possible.
100 covers any realistic burst we expect: 100 concurrent × 5 inputs/container = 500 in-flight, all hitting at once. Plus headroom for traffic-shape variance.
Files changed
policyengine_household_api/modal_release/worker_app.py— addedallow_concurrent_inputs: 5andmax_containers: 100to the base options dict (applies to all environments).tests/unit/modal_release/test_worker_app.py— two new tests asserting both settings are present inmain,staging,testing.changelog.d/1533.changed.md— Towncrier fragment.Out of scope
min_containers/buffer_containers. The Layer 1 change here gives 5x capacity for free; we want to see real partner traffic with the new concurrency setting before deciding whether to also pay for more always-on containers (Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533 discussion).Verification plan after merge (staging-first via existing automation)
Risks and mitigations
worker_app.py).Test plan
make format-checkclean