Skip to content

Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533

Description

@hua7450

Context

After PR #1528 enabled memory snapshots, cold-start latency dropped from 50-105s to ~26s. Empirical burst testing (5/10/20/50/100 concurrent against production) confirmed the snapshot is working and surfaced two cheap optimizations we should ship together.

Problem 1: Each container only handles 1 request at a time

With allow_concurrent_inputs unset (Modal default = 1), each warm container serves exactly one request before the next is routed elsewhere. With min_containers=3 + buffer_containers=2 = 5 warm slots, any burst >5 hits cold containers (now ~26s with snapshot, was 50-105s before).

PolicyEngine calculations are CPU-bound (~3s on 1 core). With 5-way concurrency on the same core, each request takes ~15s wall-time. That trades a 5x capacity boost for a 5x per-request latency increase — worth it because:

  • Capacity boost is free (no new containers, same idle cost)
  • Bursts up to 25-concurrent get fully absorbed in warm slots
  • 15s per-request is well within typical partner SLA

Problem 2: No max_containers cap

Currently max_containers is unset, so autoscale is bounded only by the workspace quota. A buggy partner client or runaway loop could scale to hundreds of containers and rack up unbounded cost. During our burst testing today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests but worth bounding.

100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom.

Proposed change

Add to worker_function_options in worker_app.py:

```python
"allow_concurrent_inputs": 5,
"max_containers": 100,
```

Both apply to all environments (main, staging, testing) for consistency.

Verification plan

After merge, redeploy via the existing automation, then rerun the same burst test (`~/modal_burst_test.py --sizes 5,10,20,50,100`) to confirm:

  • Bursts up to ~25 stay all-warm with p95 ≈ 15s
  • Bursts >25 still complete (cold tail at ~26s)
  • max_containers caps as expected if we deliberately push beyond 100

Out of scope

  • Tuning min_containers / buffer_containers — leave at 3/2 baseline until we observe real partner traffic with the new concurrency setting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions