Skip to content

Commit aaea837

Browse files
committed
fix: harden async scheduling admission follow-ups
Signed-off-by: Nabin Mulepati <nmulepati@nvidia.com>
1 parent ea3d4a0 commit aaea837

20 files changed

Lines changed: 538 additions & 99 deletions

File tree

fern/versions/v0.5.8/pages/concepts/architecture-and-performance.mdx

Lines changed: 44 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ This guide explains the architecture, execution model, and how to tune performan
5151
## Execution Model
5252

5353
<Note title="Two execution engines">
54-
The default execution path is the **async engine**, which dispatches work at the cell level and overlaps independent columns — see [Async Engine](#async-engine) below for its semantics. The legacy **sync engine** is still available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0` and is what this section describes. The public configuration knobs documented below (`buffer_size`, `max_parallel_requests`, error handling) apply to both engines; the differences are flagged inline.
54+
The default execution path is the **async engine**, which dispatches work at the cell level and overlaps independent columns — see [Async Engine](#async-engine) below for its semantics. The legacy **sync engine** is still available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0` and is what this section describes. The configuration knobs documented below (`buffer_size`, `max_parallel_requests`, AIMD throttle config, error handling) apply to both engines; the differences are flagged inline.
5555
</Note>
5656

5757
The sync engine processes datasets in **batches**, with **parallel** operations within each batch.
@@ -108,20 +108,20 @@ At any moment, the number of concurrent LLM requests is:
108108
```python
109109
concurrent_requests = min(
110110
buffer_size, # Records in current batch
111-
current_request_limit, # AIMD-managed limit (≤ max_parallel_requests)
111+
current_throttle_limit, # AIMD-managed limit (≤ max_parallel_requests)
112112
remaining_cells_in_column # Cells left to generate
113113
)
114114
```
115115

116-
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_request_limit`) is managed at runtime by adaptive request admission that reacts to rate-limit signals from the inference server:
116+
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
117117

118118
- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady.
119119
- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
120120

121121
This means Data Designer automatically finds the right concurrency level for your server without manual tuning.
122122

123123
<Note title="Engine paths">
124-
Adaptive request admission is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
124+
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
125125
</Note>
126126

127127
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
@@ -198,7 +198,7 @@ Only resume datasets from trusted artifact directories. Resume reads local `meta
198198

199199
### `max_parallel_requests` (InferenceParams)
200200

201-
Sets the **maximum** concurrent LLM API calls **per model**. This is the ceiling that adaptive request admission can ramp up to — the actual concurrency at runtime may be lower if the server signals rate limits.
201+
Sets the **maximum** concurrent LLM API calls **per model**. This is the ceiling that the AIMD throttle controller can ramp up to — the actual concurrency at runtime may be lower if the server signals rate limits.
202202

203203
```python
204204
import data_designer.config as dd
@@ -215,15 +215,15 @@ model = dd.ModelConfig(
215215

216216
**Default**: 4
217217

218-
**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With adaptive request admission, setting an aggressively high value is safer than before — the system will self-correct downward if the server can't keep up. The salvage queue on the async engine (default) reclaims failed rows; on the sync engine the initial burst of 429s before AIMD stabilizes can drop rows, so start with a more conservative ceiling if you've opted into sync.
218+
**When to increase**: Your inference backend has high throughput capacity, you're using a cloud API with generous rate limits, or you're running vLLM/TensorRT-LLM with multiple GPUs. With AIMD, setting an aggressively high value is safer than before — the system will self-correct downward if the server can't keep up. The salvage queue on the async engine (default) reclaims failed rows; on the sync engine the initial burst of 429s before AIMD stabilizes can drop rows, so start with a more conservative ceiling if you've opted into sync.
219219

220220
**When to decrease**: You want to cap resource usage to a known safe level, or you want more predictable/debuggable execution.
221221

222222
<Tip>
223223
Finding the optimal value
224224
The right value depends on your inference stack and model. Self-hosted vLLM servers can often handle values as high as 256, 512, or even 1024 depending on your hardware.
225225

226-
With adaptive request admission, a practical approach is to set `max_parallel_requests` to the **upper bound** you're comfortable with and let the request controller find the sustainable level automatically. If you see frequent 429 → recovery cycles in the logs, your ceiling is above the server's true capacity but the system is handling it. If you never see any request-admission activity, you may have room to increase the ceiling further.
226+
With AIMD, a practical approach is to set `max_parallel_requests` to the **upper bound** you're comfortable with and let the throttle controller find the sustainable level automatically. If you see frequent 429 → recovery cycles in the logs, your ceiling is above the server's true capacity but the system is handling it. If you never see any throttle activity, you may have room to increase the ceiling further.
227227

228228
**Benchmark approach**: Run a small dataset (e.g., 100 records) with increasing `max_parallel_requests` values (4 → 8 → 16 → 32 → ...) and measure generation time. Stop increasing when the runtime stops decreasing—that's when your inference server is saturated.
229229
</Tip>
@@ -245,9 +245,39 @@ designer.set_run_config(run_config)
245245

246246
---
247247

248-
### Adaptive Request Admission
248+
### Adaptive Throttling (RunConfig)
249249

250-
Data Designer uses AIMD (Additive Increase / Multiplicative Decrease) request admission to automatically adjust concurrency per provider/model/domain based on rate-limit feedback from the inference server. This is an internal runtime controller, not a public `RunConfig` knob. Set `max_parallel_requests` as the user-facing ceiling and inspect `AsyncCapacityPlan`/logs to understand the effective runtime limits.
250+
Data Designer uses an AIMD (Additive Increase / Multiplicative Decrease) controller to automatically adjust concurrency per model based on rate-limit feedback from the inference server. The defaults work well for most workloads. Override them via `ThrottleConfig` only when you understand the trade-offs.
251+
252+
<Note title="Engine paths">
253+
Adaptive throttling is fully active on the default **async engine**, where 429 responses propagate directly to the AIMD controller. On the legacy **sync engine** (`DATA_DESIGNER_ASYNC_ENGINE=0`), 429s are first retried at the HTTP transport layer; `ThrottleConfig` settings only take effect as a fallback if transport retries are exhausted.
254+
</Note>
255+
256+
```python
257+
import data_designer.config as dd
258+
from data_designer.interface import DataDesigner
259+
260+
run_config = dd.RunConfig(
261+
throttle=dd.ThrottleConfig(
262+
reduce_factor=0.75, # Multiply limit by this on a 429 (default: 0.75)
263+
additive_increase=1, # Add this many slots after success_window successes (default: 1)
264+
success_window=25, # Consecutive successes before increasing (default: 25)
265+
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
266+
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
267+
),
268+
)
269+
270+
designer = DataDesigner()
271+
designer.set_run_config(run_config)
272+
```
273+
274+
| Parameter | Default | Effect |
275+
|-----------|---------|--------|
276+
| `reduce_factor` | 0.75 | How aggressively to cut concurrency on a 429. Lower = more aggressive. |
277+
| `additive_increase` | 1 | Slots added per recovery step. Higher = faster ramp-up, but riskier. |
278+
| `success_window` | 25 | Consecutive successes required before each increase step. |
279+
| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
280+
| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
251281

252282
<Tip>
253283
How it works in practice
@@ -283,11 +313,11 @@ designer.set_run_config(run_config)
283313

284314
## Async Engine
285315

286-
The async engine is the default execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and provider/model/domain request resources tune themselves independently. See the [Async All the Way Down](/dev-notes/async-all-the-way-down) dev note for the full architecture.
316+
The async engine is the default execution path. It dispatches work at the cell level rather than the column level, so independent columns overlap in time and per-(provider, model) AIMD pools tune themselves independently. See the [Async All the Way Down](/dev-notes/async-all-the-way-down) dev note for the full architecture.
287317

288318
### Per-model timeouts drive every deadline
289319

290-
The `inference_parameters.timeout` field on a `ModelConfig` sets the per-request HTTP timeout. The same value also drives the sync→async bridge that custom columns use when they call `model.generate()`. There is no separate queue-wait deadline — waits scale with provider speed and adaptive request admission. Slow self-hosted endpoints (e.g. large models on a single GPU) only need this one knob raised:
320+
The `inference_parameters.timeout` field on a `ModelConfig` sets the per-request HTTP timeout. The same value also drives the sync→async bridge that custom columns use when they call `model.generate()`. There is no separate queue-wait deadline — waits scale with provider speed and AIMD's adaptive concurrency. Slow self-hosted endpoints (e.g. large models on a single GPU) only need this one knob raised:
291321

292322
```python
293323
import data_designer.config as dd
@@ -336,8 +366,8 @@ DATA_DESIGNER_ASYNC_ENGINE=0 python my_pipeline.py
336366

337367
| Problem | Symptom | Solution |
338368
|---------|---------|----------|
339-
| **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size`. If request admission has self-reduced due to earlier 429s (check logs for "concurrency reduced" messages), the server may need more capacity or you can wait for AIMD recovery. |
340-
| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth. |
369+
| **Low throughput** | Low GPU utilization | Increase `max_parallel_requests` and/or `buffer_size`. If the throttle has self-reduced due to earlier 429s (check logs for "concurrency reduced" messages), the server may need more capacity or you can wait for AIMD recovery. |
370+
| **Frequent 429 → recovery cycles** | Logs show repeated concurrency drops and ramp-ups | The `max_parallel_requests` ceiling is above the server's sustained capacity. This is handled automatically, but you can lower the ceiling to reduce the sawtooth or tune `reduce_factor` / `success_window`. |
341371
| **Long tail of slow generations** | Most records fast, few very slow | Reduce `max_conversation_restarts`, simplify schemas, improve prompts |
342372
| **Multi-model idle periods** | One model busy, others idle | Reduce `buffer_size` for faster cycling, or consolidate models |
343373
| **Memory errors** | OOM crashes | Reduce `buffer_size` and `max_parallel_requests` |
@@ -350,7 +380,7 @@ DATA_DESIGNER_ASYNC_ENGINE=0 python my_pipeline.py
350380
1. **Start with defaults** for initial development — AIMD handles rate-limit adaptation automatically
351381
2. **Profile your workload**: How many LLM columns? How many records? What models?
352382
3. **Identify bottleneck**: Low GPU util → increase `max_parallel_requests` (AIMD will self-correct if you overshoot). Memory issues → decrease `buffer_size`. Long tails → tune retry settings.
353-
4. **Check request-admission logs**: Look for "concurrency reduced" / "concurrency increased" messages to understand whether rate limits are the bottleneck
383+
4. **Check throttle logs**: Look for "concurrency reduced" / "concurrency increased" messages to understand whether rate limits are the bottleneck
354384
5. **Iterate**: Make one change at a time, measure impact before next change
355385

356386
---

fern/versions/v0.5.8/pages/devnotes/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ Welcome to NeMo Data Designer Dev Notes — in-depth guides, benchmark write-ups
4444
<BlogCard
4545
href="/dev-notes/owning-the-model-stack"
4646
title="Owning the Model Stack"
47-
description="Adaptive concurrency, request-resource keying, retry boundaries — owning the whole model client to discover provider capacity at runtime."
47+
description="Adaptive concurrency, throttle keying, retry boundaries — owning the whole model client to discover provider capacity at runtime."
4848
date="Mar 25, 2026"
4949
authors={["nmulepati"]}
5050
image={<img src="/assets/owning-the-model-stack/native-model-client-hero.png" alt="" loading="lazy" />}

0 commit comments

Comments
 (0)