Skip to content

Commit 162e63b

Browse files
Merge branch 'main' into andreatgretel/fix/fern-versioned-docs
2 parents c71a27f + a4085c4 commit 162e63b

12 files changed

Lines changed: 546 additions & 27 deletions

File tree

architecture/models.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ Manages concurrency limits per `ThrottleDomain` (CHAT, EMBEDDING, IMAGE, HEALTHC
3939

4040
`ThrottledModelClient` wraps each API call in a context manager that acquires/releases throttle capacity and adjusts limits on success (additive increase) or rate-limit errors (multiplicative decrease).
4141

42+
When `rampup_seconds` is configured, `ThrottleManager` starts new domains at one concurrent request, climbs linearly toward the peak, and aborts to normal AIMD behavior on the first 429.
43+
4244
### ModelFacade
4345

4446
The primary interface for generators. Holds a `ModelConfig`, `ModelClient`, optional `MCPRegistry`, and `ModelUsageStats`.

docs/concepts/architecture-and-performance.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ concurrent_requests = min(
111111

112112
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
113113

114+
- **During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
114115
- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady.
115116
- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
116117

@@ -119,7 +120,7 @@ This means Data Designer automatically finds the right concurrency level for you
119120
!!! note "Engine paths"
120121
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
121122

122-
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
123+
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.
123124

124125
---
125126

@@ -216,6 +217,7 @@ run_config = dd.RunConfig(
216217
success_window=25, # Consecutive successes before increasing (default: 25)
217218
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
218219
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
220+
rampup_seconds=0.0, # Optional startup ramp duration; 0 disables it (default: 0.0)
219221
),
220222
)
221223

@@ -230,6 +232,7 @@ designer.set_run_config(run_config)
230232
| `success_window` | 25 | Consecutive successes required before each increase step. |
231233
| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
232234
| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
235+
| `rampup_seconds` | 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |
233236

234237
!!! tip "How it works in practice"
235238
When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load.

docs/devnotes/posts/owning-the-model-stack.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -64,12 +64,13 @@ What you actually want is a system that *discovers* the provider's capacity at r
6464

6565
If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:
6666

67+
- **During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
6768
- **On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
6869
- **On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.
6970

7071
The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.
7172

72-
The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
73+
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.
7374

7475
<div style="text-align: center;" markdown>
7576

@@ -79,6 +80,12 @@ The result is that the system converges on the provider's actual capacity withou
7980

8081
This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.
8182

83+
### **Startup ramp**
84+
85+
Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.
86+
87+
The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.
88+
8289
### **Ceiling stabilization**
8390

8491
Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
@@ -147,6 +154,7 @@ data_designer.set_run_config(
147154
success_window=25,
148155
cooldown_seconds=2.0,
149156
ceiling_overshoot=0.10,
157+
rampup_seconds=0.0,
150158
)
151159
)
152160
)
@@ -178,8 +186,9 @@ create_result = data_designer.create(
178186
| `success_window` | 25 | Consecutive successes before additive increase |
179187
| `cooldown_seconds` | 2.0 | Default cooldown when no `Retry-After` header |
180188
| `ceiling_overshoot` | 0.10 | How far above the observed ceiling to probe (10%) |
189+
| `rampup_seconds` | 0.0 | Optional startup ramp duration. `0.0` keeps the previous immediate-start behavior |
181190

182-
In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
191+
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.
183192

184193
Most users will never need to touch any of these. The system adapts automatically.
185194

fern/versions/v0.5.8/pages/concepts/architecture-and-performance.mdx

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,7 @@ concurrent_requests = min(
115115

116116
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
117117

118+
- **During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
118119
- **On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady.
119120
- **After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
120121

@@ -124,7 +125,7 @@ This means Data Designer automatically finds the right concurrency level for you
124125
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
125126
</Note>
126127

127-
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
128+
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.
128129

129130
---
130131

@@ -264,6 +265,7 @@ run_config = dd.RunConfig(
264265
success_window=25, # Consecutive successes before increasing (default: 25)
265266
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
266267
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
268+
rampup_seconds=0.0, # Optional startup ramp duration; 0 disables it (default: 0.0)
267269
),
268270
)
269271

@@ -278,6 +280,7 @@ designer.set_run_config(run_config)
278280
| `success_window` | 25 | Consecutive successes required before each increase step. |
279281
| `cooldown_seconds` | 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
280282
| `ceiling_overshoot` | 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
283+
| `rampup_seconds` | 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |
281284

282285
<Tip>
283286
How it works in practice

fern/versions/v0.5.8/pages/devnotes/posts/owning-the-model-stack.mdx

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -66,12 +66,13 @@ What you actually want is a system that *discovers* the provider's capacity at r
6666

6767
If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:
6868

69+
- **During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
6970
- **On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
7071
- **On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.
7172

7273
The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.
7374

74-
The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
75+
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.
7576

7677
<div style="text-align: center;" markdown>
7778

@@ -81,6 +82,12 @@ The result is that the system converges on the provider's actual capacity withou
8182

8283
This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.
8384

85+
### **Startup ramp**
86+
87+
Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.
88+
89+
The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.
90+
8491
### **Ceiling stabilization**
8592

8693
Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
@@ -149,6 +156,7 @@ data_designer.set_run_config(
149156
success_window=25,
150157
cooldown_seconds=2.0,
151158
ceiling_overshoot=0.10,
159+
rampup_seconds=0.0,
152160
)
153161
)
154162
)
@@ -180,8 +188,9 @@ create_result = data_designer.create(
180188
| `success_window` | 25 | Consecutive successes before additive increase |
181189
| `cooldown_seconds` | 2.0 | Default cooldown when no `Retry-After` header |
182190
| `ceiling_overshoot` | 0.10 | How far above the observed ceiling to probe (10%) |
191+
| `rampup_seconds` | 0.0 | Optional startup ramp duration. `0.0` keeps the previous immediate-start behavior |
183192

184-
In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
193+
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.
185194

186195
Most users will never need to touch any of these. The system adapts automatically.
187196

packages/data-designer-config/src/data_designer/config/run_config.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,19 @@ class ThrottleConfig(ConfigBase):
4040
ceiling_overshoot: Fraction above the observed rate-limit ceiling
4141
that additive increase is allowed to probe before capping.
4242
Default is 0.10 (10% overshoot).
43+
rampup_seconds: Optional startup ramp duration. When greater than
44+
zero, each throttle domain starts at one concurrent request and
45+
linearly ramps to its configured peak over this many seconds.
46+
A 429 aborts the startup ramp and switches to normal AIMD recovery.
47+
Default is 0.0 (disabled).
4348
"""
4449

4550
DEFAULT_REDUCE_FACTOR: ClassVar[float] = 0.75
4651
DEFAULT_ADDITIVE_INCREASE: ClassVar[int] = 1
4752
DEFAULT_SUCCESS_WINDOW: ClassVar[int] = 25
4853
DEFAULT_COOLDOWN_SECONDS: ClassVar[float] = 2.0
4954
DEFAULT_CEILING_OVERSHOOT: ClassVar[float] = 0.10
55+
DEFAULT_RAMPUP_SECONDS: ClassVar[float] = 0.0
5056

5157
reduce_factor: float = Field(
5258
default=DEFAULT_REDUCE_FACTOR,
@@ -74,6 +80,14 @@ class ThrottleConfig(ConfigBase):
7480
ge=0.0,
7581
description="Fraction above the rate-limit ceiling that additive increase is allowed to probe.",
7682
)
83+
rampup_seconds: float = Field(
84+
default=DEFAULT_RAMPUP_SECONDS,
85+
ge=0.0,
86+
description=(
87+
"Startup ramp duration in seconds. When greater than zero, each throttle domain starts at one "
88+
"concurrent request and linearly ramps to the configured peak. A 429 aborts the startup ramp."
89+
),
90+
)
7791

7892

7993
class RunConfig(ConfigBase):

0 commit comments

Comments
 (0)