You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add max_concurrency to SDK reference and README (#317)
Add documentation for the max_concurrency parameter introduced in PR #301:
- Add max_concurrency to Endpoint constructor signature and parameter table
- Add validation rule (must be >= 1)
- Add usage example with async QB handler and behavior notes
- Add concurrency control bullet to README feature list
|`template`|`PodTemplate`|`None`| Pod template overrides (e.g., `PodTemplate(containerDiskInGb=100)`). |
60
61
|`min_cuda_version`|`str`|`None`| Minimum CUDA version for GPU host selection. GPU endpoints default to `"12.8"` when not set. Has no effect on CPU endpoints. |
62
+
|`max_concurrency`|`int`|`1`| Max concurrent jobs per worker (QB endpoints only). Values >1 only achieve true concurrency with async handlers; sync handlers log a build-time warning. Ignored on LB endpoints. |
61
63
62
64
**Validation rules:**
63
65
64
66
-`gpu` and `cpu` are mutually exclusive
65
67
-`id` and `image` are mutually exclusive
66
68
-`name` or `id` is required
67
69
-`workers` rejects negative values and `min > max`
70
+
-`max_concurrency` must be >= 1
68
71
69
72
### Usage Patterns
70
73
@@ -99,6 +102,28 @@ class MyModel:
99
102
100
103
The class is instantiated once per worker (singleton). For single-method classes, input is auto-dispatched to the method. For multi-method classes, include `"method"` in the input payload.
101
104
105
+
> **Concurrency:** Use `max_concurrency` to let a single worker handle multiple jobs at once. See the example below.
result =await run_model(prompt) # your model inference call
111
+
return {"output": result}
112
+
```
113
+
114
+
`max_concurrency` controls how many jobs a single worker processes simultaneously. Each worker pulls up to `max_concurrency` jobs from the queue at once.
115
+
116
+
**Behavior by handler type:**
117
+
118
+
-**Async handlers** (`async def`): True concurrent execution. Multiple jobs interleave on the event loop. This is the intended usage.
119
+
-**Sync handlers** (`def`): The runtime pulls multiple jobs from the queue, but the sync handler processes them one at a time. A warning is logged at build time. Consider making the handler async.
120
+
121
+
**Warnings:**
122
+
123
+
-`max_concurrency > 1` on a sync handler logs a build-time warning
124
+
-`max_concurrency > 100` logs a capacity warning (most GPU workloads saturate well below this)
125
+
-`max_concurrency` on LB endpoints is ignored (FastAPI/uvicorn manages its own concurrency)
126
+
102
127
#### Load-Balanced (LB) -- instance with route decorators
0 commit comments