You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`ThrottledModelClient` wraps each API call in a context manager that acquires/releases throttle capacity and adjusts limits on success (additive increase) or rate-limit errors (multiplicative decrease).
41
41
42
+
When `rampup_seconds` is configured, `ThrottleManager` starts new domains at one concurrent request, climbs linearly toward the peak, and aborts to normal AIMD behavior on the first 429.
43
+
42
44
### ModelFacade
43
45
44
46
The primary interface for generators. Holds a `ModelConfig`, `ModelClient`, optional `MCPRegistry`, and `ModelUsageStats`.
Copy file name to clipboardExpand all lines: docs/concepts/architecture-and-performance.md
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -111,6 +111,7 @@ concurrent_requests = min(
111
111
112
112
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
113
113
114
+
-**During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
114
115
-**On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady.
115
116
-**After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
116
117
@@ -119,7 +120,7 @@ This means Data Designer automatically finds the right concurrency level for you
119
120
!!! note "Engine paths"
120
121
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
121
122
122
-
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
123
+
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.
123
124
124
125
---
125
126
@@ -216,6 +217,7 @@ run_config = dd.RunConfig(
216
217
success_window=25, # Consecutive successes before increasing (default: 25)
217
218
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
218
219
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
|`success_window`| 25 | Consecutive successes required before each increase step. |
231
233
|`cooldown_seconds`| 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
232
234
|`ceiling_overshoot`| 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
235
+
|`rampup_seconds`| 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |
233
236
234
237
!!! tip "How it works in practice"
235
238
When a model endpoint returns HTTP 429, the controller reduces the concurrency limit for that model and pauses briefly. After enough consecutive successes, it begins ramping back up. If the server rate-limits again, the controller records that level as a ceiling and stabilizes just below it, with a small overshoot band to detect when the server can handle more load.
Copy file name to clipboardExpand all lines: docs/devnotes/posts/owning-the-model-stack.md
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -64,12 +64,13 @@ What you actually want is a system that *discovers* the provider's capacity at r
64
64
65
65
If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:
66
66
67
+
-**During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
67
68
-**On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
68
69
-**On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.
69
70
70
71
The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.
71
72
72
-
The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
73
+
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.
73
74
74
75
<divstyle="text-align: center;"markdown>
75
76
@@ -79,6 +80,12 @@ The result is that the system converges on the provider's actual capacity withou
79
80
80
81
This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.
81
82
83
+
### **Startup ramp**
84
+
85
+
Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.
86
+
87
+
The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.
88
+
82
89
### **Ceiling stabilization**
83
90
84
91
Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
191
+
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.
183
192
184
193
Most users will never need to touch any of these. The system adapts automatically.
Copy file name to clipboardExpand all lines: fern/versions/v0.5.8/pages/concepts/architecture-and-performance.mdx
+4-1Lines changed: 4 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,6 +115,7 @@ concurrent_requests = min(
115
115
116
116
`max_parallel_requests` sets the **ceiling**. The actual limit (`current_throttle_limit`) is managed at runtime by an AIMD (Additive Increase / Multiplicative Decrease) controller that reacts to rate-limit signals from the inference server:
117
117
118
+
-**During optional startup ramp**: when `rampup_seconds` is greater than 0, a new throttle domain starts at one concurrent request and increases linearly toward `max_parallel_requests` over that duration.
118
119
-**On the first 429 in a burst**: the limit is reduced by a configurable factor (default: 25% reduction) and a cooldown is applied. Further 429s from already in-flight requests in the same burst do not reduce the limit again — they release their permits and hold the limit steady.
119
120
-**After consecutive successes**: the limit increases by 1 (by default) until it reaches the ceiling or a stabilized rate-limit threshold.
120
121
@@ -124,7 +125,7 @@ This means Data Designer automatically finds the right concurrency level for you
124
125
AIMD adaptive concurrency is fully active on the default **async engine**. The legacy **sync engine** is available for one transitional release via `DATA_DESIGNER_ASYNC_ENGINE=0`; on that path 429s are first retried at the HTTP transport layer and AIMD only engages as a fallback. See [Async engine](#async-engine) below.
125
126
</Note>
126
127
127
-
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer starts sending up to 32 requests in parallel. If the server returns 429s, concurrency drops automatically (e.g., to 24, then 18) and recovers once the server catches up.
128
+
**Example**: With `buffer_size=100` and `max_parallel_requests=32`, Data Designer can send up to 32 requests in parallel. If `rampup_seconds=30`, it starts at one request and climbs linearly toward 32 over 30 seconds. If the server returns 429s, startup ramp stops, concurrency drops automatically (e.g., to 24, then 18), and normal AIMD recovery takes over once the server catches up.
128
129
129
130
---
130
131
@@ -264,6 +265,7 @@ run_config = dd.RunConfig(
264
265
success_window=25, # Consecutive successes before increasing (default: 25)
265
266
cooldown_seconds=2.0, # Pause after a 429 when no Retry-After header (default: 2.0)
266
267
ceiling_overshoot=0.10, # Probe 10% above observed server limit (default: 0.10)
|`success_window`| 25 | Consecutive successes required before each increase step. |
279
281
|`cooldown_seconds`| 2.0 | Pause duration after a 429 (used when the server doesn't send `Retry-After`). |
280
282
|`ceiling_overshoot`| 0.10 | Fraction above the observed rate-limit ceiling the controller is allowed to probe. |
283
+
|`rampup_seconds`| 0.0 | Optional startup ramp duration. When greater than 0, domains start at one concurrent request and linearly climb to the configured ceiling unless a 429 aborts the ramp. |
Copy file name to clipboardExpand all lines: fern/versions/v0.5.8/pages/devnotes/posts/owning-the-model-stack.mdx
+11-2Lines changed: 11 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -66,12 +66,13 @@ What you actually want is a system that *discovers* the provider's capacity at r
66
66
67
67
If you've studied networking, this will sound familiar. [AIMD](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) is the algorithm behind TCP congestion control. We apply the same idea to LLM API concurrency:
68
68
69
+
-**During optional startup ramp**: if `rampup_seconds` is set, start a new route at one concurrent request and climb linearly toward `max_parallel_requests` over that duration.
69
70
-**On success**: after a window of consecutive successful requests (default: 25), increase the concurrency limit by 1. Slow, cautious growth.
70
71
-**On 429**: multiply the current limit by a reduce factor (default: 0.75, a 25% cut). Fast, decisive pullback. Then apply a cooldown using the provider's `Retry-After` header when available, or a default of 2 seconds.
71
72
72
73
The asymmetry is deliberate. You probe upward slowly because overshooting wastes requests. You pull back quickly because staying above the limit wastes *everything* because every request in the burst gets rejected. This is the same insight that makes TCP work: be optimistic cautiously, be pessimistic decisively.
73
74
74
-
The result is that the system converges on the provider's actual capacity without you setting it. It starts at your configured `max_parallel_requests`, discovers the real limit through 429 signals, and settles into a steady state that tracks the provider's capacity as it changes.
75
+
The result is that the system converges on the provider's actual capacity without you setting it. By default it starts at your configured `max_parallel_requests`; for cold inference servers, you can set `rampup_seconds` to ease in from 1 request to that configured peak. Either way, once a 429 arrives, the controller discovers the real limit through rate-limit signals and settles into a steady state that tracks the provider's capacity as it changes.
75
76
76
77
<divstyle="text-align: center;"markdown>
77
78
@@ -81,6 +82,12 @@ The result is that the system converges on the provider's actual capacity withou
81
82
82
83
This is especially useful when you're self-hosting your inference stack (running vLLM or NVIDIA NIM on your own hardware) as long as the serving framework returns 429s when it's at capacity. The capacity of a self-hosted endpoint depends on your GPU count, model size, quantization, batch settings, and whatever else is sharing the cluster. That capacity might change between runs, or even mid-run if other workloads spin up. If your serving layer signals overload with 429s, you don't need to figure any of that out. Point Data Designer at your endpoint, set `max_parallel_requests` to a generous upper bound, and the system self-adjusts to whatever your infrastructure can actually handle.
83
84
85
+
### **Startup ramp**
86
+
87
+
Some inference servers do not handle an immediate cold burst well, even when their steady-state capacity is high. For those endpoints, `ThrottleConfig(rampup_seconds=...)` enables a time-based startup ramp. Each throttle domain starts at one concurrent request and linearly climbs toward the configured `max_parallel_requests` ceiling over the ramp duration.
88
+
89
+
The ramp is optimistic but interruptible. If no 429s arrive, it reaches the configured peak. If a 429 arrives during the ramp, the ramp is aborted immediately and the domain switches to normal AIMD behavior: multiplicative decrease, cooldown, ceiling recording when the decrease reveals a higher failed limit, and additive recovery.
90
+
84
91
### **Ceiling stabilization**
85
92
86
93
Classic AIMD has a well-known problem, the sawtooth. After a 429 drops the limit, additive increase climbs all the way back to the configured max, hits another 429, drops again, and repeats. Every climb wastes requests, and the 429 bursts are predictable.
In practice, the parameter most worth adjusting is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all.
193
+
In practice, the parameter most worth adjusting after a 429 is `success_window`. A smaller window (say, 10) makes the system more aggressive about reclaiming throughput after a pullback, useful when you know the provider's capacity fluctuates quickly. A larger window (say, 50) makes it more conservative, better for providers with strict, stable rate limits where you'd rather not probe at all. For cold self-hosted endpoints, set `rampup_seconds` to ease into the first burst without changing steady-state AIMD behavior.
185
194
186
195
Most users will never need to touch any of these. The system adapts automatically.
0 commit comments