Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions architecture/models.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ Manages concurrency limits per `ThrottleDomain` (CHAT, EMBEDDING, IMAGE, HEALTHC

`ThrottledModelClient` wraps each API call in a context manager that acquires/releases throttle capacity and adjusts limits on success (additive increase) or rate-limit errors (multiplicative decrease).

When `rampup_seconds` is configured, `ThrottleManager` starts new domains at one concurrent request, climbs linearly toward the peak, and aborts to normal AIMD behavior on the first 429.

### ModelFacade

The primary interface for generators. Holds a `ModelConfig`, `ModelClient`, optional `MCPRegistry`, and `ModelUsageStats`.
Expand Down
5 changes: 4 additions & 1 deletion docs/concepts/architecture-and-performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ concurrent_requests = min(

`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:

- **During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again β€” they release their permits and hold the limit steady.
- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.

Expand All @@ -119,7 +120,7 @@ This means Data Designer automatically finds the right concurrency level for you
!!! note "Engine paths"
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.

**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.

---

Expand Down Expand Up @@ -216,6 +217,7 @@ run_config = dd.RunConfig(
success_window=25, # Consecutive successes before increasing (default: 25)
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
rampup_seconds=0.0, # Optional startup ramp duration; 0 disables it (default: 0.0)
),
)

Expand All @@ -230,6 +232,7 @@ designer.set_run_config(run_config)
| `success_window` | 25 | Consecutive successes required before each increase step. |
| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
| `rampup_seconds` | 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |

!!! tip "How it works in practice"
When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load.
Expand Down
13 changes: 11 additions & 2 deletions docs/devnotes/posts/owning-the-model-stack.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,12 +64,13 @@ What you actually want is a system that *discovers* the provider's capacity at r

If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:

- **During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
- **On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
- **On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.

The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.

The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.

<div style="text-align: center;" markdown>

Expand All @@ -79,6 +80,12 @@ The result is that the system converges on the provider's actual capacity withou

This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.

### **Startup ramp**

Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.

The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.

### **Ceiling stabilization**

Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
Expand Down Expand Up @@ -147,6 +154,7 @@ data_designer.set_run_config(
success_window=25,
cooldown_seconds=2.0,
ceiling_overshoot=0.10,
rampup_seconds=0.0,
)
)
)
Expand Down Expand Up @@ -178,8 +186,9 @@ create_result = data_designer.create(
| `success_window` | 25 | Consecutive successes before additive increase |
| `cooldown_seconds` | 2.0 | Default cooldown when no `Retry-After` header |
| `ceiling_overshoot` | 0.10 | How far above the observed ceiling to probe (10%) |
| `rampup_seconds` | 0.0 | Optional startup ramp duration. `0.0` keeps the previous immediate-start behavior |

In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.

Most users will never need to touch any of these. The system adapts automatically.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ concurrent_requests = min(

`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:

- **During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again β€” they release their permits and hold the limit steady.
- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.

Expand All @@ -124,7 +125,7 @@ This means Data Designer automatically finds the right concurrency level for you
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
</Note>

**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.

---

Expand Down Expand Up @@ -264,6 +265,7 @@ run_config = dd.RunConfig(
success_window=25, # Consecutive successes before increasing (default: 25)
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
rampup_seconds=0.0, # Optional startup ramp duration; 0 disables it (default: 0.0)
),
)

Expand All @@ -278,6 +280,7 @@ designer.set_run_config(run_config)
| `success_window` | 25 | Consecutive successes required before each increase step. |
| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
| `rampup_seconds` | 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |

<Tip>
How it works in practice
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,13 @@ What you actually want is a system that *discovers* the provider's capacity at r

If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:

- **During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
- **On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
- **On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.

The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.

The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.

<div style="text-align: center;" markdown>

Expand All @@ -81,6 +82,12 @@ The result is that the system converges on the provider's actual capacity withou

This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.

### **Startup ramp**

Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.

The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.

### **Ceiling stabilization**

Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
Expand Down Expand Up @@ -149,6 +156,7 @@ data_designer.set_run_config(
success_window=25,
cooldown_seconds=2.0,
ceiling_overshoot=0.10,
rampup_seconds=0.0,
)
)
)
Expand Down Expand Up @@ -180,8 +188,9 @@ create_result = data_designer.create(
| `success_window` | 25 | Consecutive successes before additive increase |
| `cooldown_seconds` | 2.0 | Default cooldown when no `Retry-After` header |
| `ceiling_overshoot` | 0.10 | How far above the observed ceiling to probe (10%) |
| `rampup_seconds` | 0.0 | Optional startup ramp duration. `0.0` keeps the previous immediate-start behavior |

In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.

Most users will never need to touch any of these. The system adapts automatically.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,19 @@ class ThrottleConfig(ConfigBase):
ceiling_overshoot: Fraction above the observed rate-limit ceiling
that additive increase is allowed to probe before capping.
Default is 0.10 (10% overshoot).
rampup_seconds: Optional startup ramp duration. When greater than
zero, each throttle domain starts at one concurrent request and
linearly ramps to its configured peak over this many seconds.
A 429 aborts the startup ramp and switches to normal AIMD recovery.
Default is 0.0 (disabled).
"""

DEFAULT_REDUCE_FACTOR: ClassVar[float] = 0.75
DEFAULT_ADDITIVE_INCREASE: ClassVar[int] = 1
DEFAULT_SUCCESS_WINDOW: ClassVar[int] = 25
DEFAULT_COOLDOWN_SECONDS: ClassVar[float] = 2.0
DEFAULT_CEILING_OVERSHOOT: ClassVar[float] = 0.10
DEFAULT_RAMPUP_SECONDS: ClassVar[float] = 0.0

reduce_factor: float = Field(
default=DEFAULT_REDUCE_FACTOR,
Expand Down Expand Up @@ -74,6 +80,14 @@ class ThrottleConfig(ConfigBase):
ge=0.0,
description="Fraction above the rate-limit ceiling that additive increase is allowed to probe.",
)
rampup_seconds: float = Field(
default=DEFAULT_RAMPUP_SECONDS,
ge=0.0,
description=(
"Startup ramp duration in seconds. When greater than zero, each throttle domain starts at one "
"concurrent request and linearly ramps to the configured peak. A 429 aborts the startup ramp."
),
)


class RunConfig(ConfigBase):
Expand Down
Loading
Loading