Skip to content

Commit cabb2b7

Browse files
authored
docs: add max_concurrency to SDK reference and README (#317)
Add documentation for the max_concurrency parameter introduced in PR #301: - Add max_concurrency to Endpoint constructor signature and parameter table - Add validation rule (must be >= 1) - Add usage example with async QB handler and behavior notes - Add concurrency control bullet to README feature list
1 parent df9fc31 commit cabb2b7

2 files changed

Lines changed: 26 additions & 0 deletions

File tree

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -118,6 +118,7 @@ First run takes 30-60 seconds (provisioning). Subsequent runs take 2-3 seconds.
118118
- **Auto-scaling**: Workers scale from 0 to N based on demand
119119
- **Dependency management**: Packages install automatically on remote workers
120120
- **Two patterns**: Queue-based (`@Endpoint`) for batch work, load-balanced (`Endpoint()` + routes) for REST APIs
121+
- **Concurrency control**: `max_concurrency` lets each worker process multiple jobs simultaneously
121122

122123
## Documentation
123124

docs/Flash_SDK_Reference.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ Endpoint(
3131
scaler_value: int = 4,
3232
template: Optional[PodTemplate] = None,
3333
min_cuda_version: Optional[CudaVersion | str] = None,
34+
max_concurrency: int = 1,
3435
)
3536
```
3637

@@ -58,13 +59,15 @@ Endpoint(
5859
| `scaler_value` | `int` | `4` | Scaling threshold value. |
5960
| `template` | `PodTemplate` | `None` | Pod template overrides (e.g., `PodTemplate(containerDiskInGb=100)`). |
6061
| `min_cuda_version` | `str` | `None` | Minimum CUDA version for GPU host selection. GPU endpoints default to `"12.8"` when not set. Has no effect on CPU endpoints. |
62+
| `max_concurrency` | `int` | `1` | Max concurrent jobs per worker (QB endpoints only). Values >1 only achieve true concurrency with async handlers; sync handlers log a build-time warning. Ignored on LB endpoints. |
6163

6264
**Validation rules:**
6365

6466
- `gpu` and `cpu` are mutually exclusive
6567
- `id` and `image` are mutually exclusive
6668
- `name` or `id` is required
6769
- `workers` rejects negative values and `min > max`
70+
- `max_concurrency` must be >= 1
6871

6972
### Usage Patterns
7073

@@ -99,6 +102,28 @@ class MyModel:
99102

100103
The class is instantiated once per worker (singleton). For single-method classes, input is auto-dispatched to the method. For multi-method classes, include `"method"` in the input payload.
101104

105+
> **Concurrency:** Use `max_concurrency` to let a single worker handle multiple jobs at once. See the example below.
106+
107+
```python
108+
@Endpoint(name="batch-inference", gpu=GpuGroup.AMPERE_80, max_concurrency=4)
109+
async def infer(prompt: str) -> dict:
110+
result = await run_model(prompt) # your model inference call
111+
return {"output": result}
112+
```
113+
114+
`max_concurrency` controls how many jobs a single worker processes simultaneously. Each worker pulls up to `max_concurrency` jobs from the queue at once.
115+
116+
**Behavior by handler type:**
117+
118+
- **Async handlers** (`async def`): True concurrent execution. Multiple jobs interleave on the event loop. This is the intended usage.
119+
- **Sync handlers** (`def`): The runtime pulls multiple jobs from the queue, but the sync handler processes them one at a time. A warning is logged at build time. Consider making the handler async.
120+
121+
**Warnings:**
122+
123+
- `max_concurrency > 1` on a sync handler logs a build-time warning
124+
- `max_concurrency > 100` logs a capacity warning (most GPU workloads saturate well below this)
125+
- `max_concurrency` on LB endpoints is ignored (FastAPI/uvicorn manages its own concurrency)
126+
102127
#### Load-Balanced (LB) -- instance with route decorators
103128

104129
```python

0 commit comments

Comments
 (0)