Context
After PR #1528 enabled memory snapshots, cold-start latency dropped from 50-105s to ~26s. Empirical burst testing (5/10/20/50/100 concurrent against production) confirmed the snapshot is working and surfaced two cheap optimizations we should ship together.
Problem 1: Each container only handles 1 request at a time
With allow_concurrent_inputs unset (Modal default = 1), each warm container serves exactly one request before the next is routed elsewhere. With min_containers=3 + buffer_containers=2 = 5 warm slots, any burst >5 hits cold containers (now ~26s with snapshot, was 50-105s before).
PolicyEngine calculations are CPU-bound (~3s on 1 core). With 5-way concurrency on the same core, each request takes ~15s wall-time. That trades a 5x capacity boost for a 5x per-request latency increase — worth it because:
- Capacity boost is free (no new containers, same idle cost)
- Bursts up to 25-concurrent get fully absorbed in warm slots
- 15s per-request is well within typical partner SLA
Problem 2: No max_containers cap
Currently max_containers is unset, so autoscale is bounded only by the workspace quota. A buggy partner client or runaway loop could scale to hundreds of containers and rack up unbounded cost. During our burst testing today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests but worth bounding.
100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom.
Proposed change
Add to worker_function_options in worker_app.py:
```python
"allow_concurrent_inputs": 5,
"max_containers": 100,
```
Both apply to all environments (main, staging, testing) for consistency.
Verification plan
After merge, redeploy via the existing automation, then rerun the same burst test (`~/modal_burst_test.py --sizes 5,10,20,50,100`) to confirm:
- Bursts up to ~25 stay all-warm with p95 ≈ 15s
- Bursts >25 still complete (cold tail at ~26s)
- max_containers caps as expected if we deliberately push beyond 100
Out of scope
- Tuning min_containers / buffer_containers — leave at 3/2 baseline until we observe real partner traffic with the new concurrency setting.
Context
After PR #1528 enabled memory snapshots, cold-start latency dropped from 50-105s to ~26s. Empirical burst testing (5/10/20/50/100 concurrent against production) confirmed the snapshot is working and surfaced two cheap optimizations we should ship together.
Problem 1: Each container only handles 1 request at a time
With
allow_concurrent_inputsunset (Modal default = 1), each warm container serves exactly one request before the next is routed elsewhere. Withmin_containers=3 + buffer_containers=2 = 5warm slots, any burst >5 hits cold containers (now ~26s with snapshot, was 50-105s before).PolicyEngine calculations are CPU-bound (~3s on 1 core). With 5-way concurrency on the same core, each request takes ~15s wall-time. That trades a 5x capacity boost for a 5x per-request latency increase — worth it because:
Problem 2: No max_containers cap
Currently
max_containersis unset, so autoscale is bounded only by the workspace quota. A buggy partner client or runaway loop could scale to hundreds of containers and rack up unbounded cost. During our burst testing today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests but worth bounding.100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom.
Proposed change
Add to
worker_function_optionsinworker_app.py:```python
"allow_concurrent_inputs": 5,
"max_containers": 100,
```
Both apply to all environments (main, staging, testing) for consistency.
Verification plan
After merge, redeploy via the existing automation, then rerun the same burst test (`~/modal_burst_test.py --sizes 5,10,20,50,100`) to confirm:
Out of scope