Bump Modal worker concurrency to 5 and cap autoscale at 100 containers

## Context

After PR #1528 enabled memory snapshots, cold-start latency dropped from 50-105s to ~26s. Empirical burst testing (5/10/20/50/100 concurrent against production) confirmed the snapshot is working and surfaced two cheap optimizations we should ship together.

## Problem 1: Each container only handles 1 request at a time

With `allow_concurrent_inputs` unset (Modal default = 1), each warm container serves exactly one request before the next is routed elsewhere. With `min_containers=3 + buffer_containers=2 = 5` warm slots, any burst >5 hits cold containers (now ~26s with snapshot, was 50-105s before).

PolicyEngine calculations are CPU-bound (~3s on 1 core). With 5-way concurrency on the same core, each request takes ~15s wall-time. That trades a 5x capacity boost for a 5x per-request latency increase — worth it because:
- Capacity boost is free (no new containers, same idle cost)
- Bursts up to 25-concurrent get fully absorbed in warm slots
- 15s per-request is well within typical partner SLA

## Problem 2: No max_containers cap

Currently `max_containers` is unset, so autoscale is bounded only by the workspace quota. A buggy partner client or runaway loop could scale to hundreds of containers and rack up unbounded cost. During our burst testing today, Modal scaled to 36 containers absorbing a 100-burst — fine for tests but worth bounding.

100 covers any realistic partner burst (100 concurrent x 5 inputs = 500 in-flight) with headroom.

## Proposed change

Add to `worker_function_options` in `worker_app.py`:

\`\`\`python
"allow_concurrent_inputs": 5,
"max_containers": 100,
\`\`\`

Both apply to all environments (main, staging, testing) for consistency.

## Verification plan

After merge, redeploy via the existing automation, then rerun the same burst test (\`~/modal_burst_test.py --sizes 5,10,20,50,100\`) to confirm:
- Bursts up to ~25 stay all-warm with p95 ≈ 15s
- Bursts >25 still complete (cold tail at ~26s)
- max_containers caps as expected if we deliberately push beyond 100

## Out of scope

- Tuning min_containers / buffer_containers — leave at 3/2 baseline until we observe real partner traffic with the new concurrency setting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533

Context

Problem 1: Each container only handles 1 request at a time

Problem 2: No max_containers cap

Proposed change

Verification plan

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Bump Modal worker concurrency to 5 and cap autoscale at 100 containers #1533

Description

Context

Problem 1: Each container only handles 1 request at a time

Problem 2: No max_containers cap

Proposed change

Verification plan

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions