You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,6 +24,7 @@ Plexus sits in front of your LLM providers and handles protocol translation, loa
24
24
-**Model aliasing & load balancing** — Define virtual model names backed by multiple real providers with `random`, `cost`, `performance`, `latency`, or `in_order` selectors
25
25
-**Vision fallthrough** — Automatically convert images to text descriptions for models that don't natively support vision, ensuring compatibility across all providers
-**Usage tracking** — Per-request cost, token counts, latency, and TPS metrics with a built-in dashboard
28
29
-**MCP proxy** — Proxy Model Context Protocol servers through Plexus with per-request session isolation
29
30
-**User quotas** — Per-API-key rate limiting by requests or tokens with rolling, daily, or weekly windows, along with cost restriction.
@@ -196,7 +197,15 @@ Limit how much each API key can consume using rolling, daily, or weekly windows:
196
197
197
198
When a provider fails, Plexus removes it from rotation using exponential backoff: 2 min → 4 min → 8 min → ... → 5 hr cap. Successful requests reset the counter. Set disable cooldown: true on a provider to opt it out entirely.
198
199
199
-
→ See [Configuration: cooldown](docs/CONFIGURATION.md#cooldown-optional)
200
+
→ See [Configuration: cooldown](docs/CONFIGURATION.md#cooldowns)
- Client disconnects now cancel the upstream provider request, reducing wasted tokens/quota on abandoned streams.
205
+
- Global and per-provider upstream timeouts cut off requests that run too long.
206
+
- Optional stall detection can fail over slow-to-start providers before bytes reach the client, and can abort streams that become too slow mid-flight.
207
+
208
+
→ See [Configuration: Request Timeouts](docs/CONFIGURATION.md#request-timeouts) and [Configuration: Stall Detection](docs/CONFIGURATION.md#stall-detection)
Copy file name to clipboardExpand all lines: docs/CONFIGURATION.md
+200Lines changed: 200 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -90,7 +90,9 @@ A **provider** represents an upstream AI service that Plexus routes requests to.
90
90
|**Enabled**| Whether this provider is active for routing | No (default: true) |
91
91
|**Headers**| Custom HTTP headers sent with every request | No |
92
92
|**Extra Body**| Additional fields merged into every request | No |
93
+
|**Upstream Timeout**| Per-provider request timeout override in milliseconds. If unset, the global timeout is used. | No |
93
94
|**Disable Cooldown**| Exclude from automatic cooldown on errors | No |
95
+
|**Stall Detection Overrides**| Optional per-provider overrides for TTFB/throughput stall detection. Empty = inherit global setting for that field. | No |
94
96
|**Adapters**| Request/response rewrite hooks applied to every model under this provider (see [Provider Adapters](#provider-adapters)) | No |
95
97
96
98
### Multi-Protocol Providers
@@ -185,6 +187,34 @@ Quota checkers monitor upstream provider rate limits and prevent routing to exha
185
187
186
188
Quota data is available via the Management API — see [API Reference: Quota Management](/docs/openapi/openapi.yaml#/paths/~1v0~1management~1quotas).
187
189
190
+
### Provider Timeout Overrides
191
+
192
+
Each provider can optionally set `timeoutMs` to override the global upstream timeout for requests routed to that provider.
193
+
194
+
-**Unset / omitted**: inherit the global timeout
195
+
-**Set to a number**: use that provider-specific timeout instead
196
+
-**Valid range**: any positive integer millisecond value in provider config; the Admin UI exposes this as **1–3600 seconds**
197
+
198
+
Use provider overrides when one backend is predictably slower or faster than the rest. For example, you might keep a **300s** global timeout but set a **30s** timeout on a fast inference endpoint so stuck requests fail over much sooner.
199
+
200
+
### Provider Stall Detection Overrides
201
+
202
+
Each provider can optionally override any of the global stall detection settings with these fields:
-`stallGracePeriodMs` — per-provider grace period override
209
+
210
+
**Important inheritance rules:**
211
+
212
+
- If a field is **omitted**, the provider inherits the global setting for that field.
213
+
- For the nullable threshold fields (`stallTtfbMs`, `stallMinBps`), setting the value to **`null`** disables that stall dimension for the provider.
214
+
- For the non-null tuning fields (`stallTtfbBytes`, `stallWindowMs`, `stallGracePeriodMs`), `null`/empty in practice means “use the inherited global value”.
215
+
216
+
This lets you keep one global policy while tightening or relaxing stall protection for known outliers.
217
+
188
218
---
189
219
190
220
## Model Aliases
@@ -435,6 +465,176 @@ See [API Reference: Cooldown Management](/docs/openapi/openapi.yaml#/paths/~1v0~
435
465
436
466
---
437
467
468
+
## Request Timeouts
469
+
470
+
Plexus can abort upstream requests that run too long instead of waiting forever.
471
+
472
+
Timeouts matter for both reliability and cost:
473
+
474
+
- they stop abandoned or hung upstream requests from continuing to burn quota,
475
+
- they allow the dispatcher to fail over to another provider when appropriate,
476
+
- and they put a hard cap on runaway or extremely slow streams.
477
+
478
+
### Default Behavior
479
+
480
+
- **Global default timeout**: `300` seconds
481
+
- **Per-provider override**: optional via `timeoutMs`
- `windowSeconds`— sliding window width for measuring throughput
570
+
- `gracePeriodSeconds`— delay after TTFB success before throughput enforcement starts
571
+
572
+
The grace period is especially important for reasoning-heavy models that may pause naturally after their first chunk.
573
+
574
+
If throughput drops below the configured floor, Plexus aborts the stream and records the request as `stall`.
575
+
576
+
**Important:** once bytes have already been sent to the client, Plexus cannot transparently fail over the same response stream. The client must retry.
577
+
578
+
### Global Stall Settings
579
+
580
+
Global stall settings live in the Admin UI under **Config → Stall Detection** and in the management API.
581
+
582
+
| Field | Type | Default | Range | Meaning |
583
+
|------|------|---------|-------|---------|
584
+
| `ttfbSeconds` | integer or `null` | `null` | `5–120` or `null` | Max time to wait for the first meaningful bytes. `null` disables TTFB stall detection. |
585
+
| `ttfbBytes` | integer | `100` | `50–10000` | Byte threshold that must arrive within `ttfbSeconds` to count as “started”. |
| `windowSeconds` | integer | `10` | `3–30` | Sliding window width used to calculate throughput. |
588
+
| `gracePeriodSeconds` | integer | `30` | `0–120` | Delay after TTFB success before throughput enforcement starts. |
589
+
590
+
### Effective Behavior and Inheritance
591
+
592
+
- Stall detection is **enabled** for a request if either `ttfbSeconds` or `minBytesPerSecond` is active after global + provider override resolution.
593
+
- Provider overrides take precedence over global settings.
594
+
- Per-provider overrides can enable stall detection even if the global stall config is disabled.
595
+
- Leaving provider fields empty keeps the global value for that field.
596
+
597
+
### Management API
598
+
599
+
- `GET /v0/management/config/stall`— returns the current global stall detection config
600
+
- `PATCH /v0/management/config/stall`— partial update of stall settings
601
+
602
+
Example:
603
+
604
+
```json
605
+
PATCH /v0/management/config/stall
606
+
{
607
+
"ttfbSeconds": 15,
608
+
"ttfbBytes": 100,
609
+
"minBytesPerSecond": 500,
610
+
"windowSeconds": 10,
611
+
"gracePeriodSeconds": 30
612
+
}
613
+
```
614
+
615
+
Disable one dimension while keeping the other:
616
+
617
+
```json
618
+
PATCH /v0/management/config/stall
619
+
{
620
+
"ttfbSeconds": null,
621
+
"minBytesPerSecond": 400
622
+
}
623
+
```
624
+
625
+
### Recommended Starting Points
626
+
627
+
- **Fail over slow starters only**: set `ttfbSeconds`, leave `minBytesPerSecond` as `null`
628
+
- **Protect long streams too**: set both `ttfbSeconds` and `minBytesPerSecond`
629
+
- **Reasoning-heavy models**: keep a longer `gracePeriodSeconds`
630
+
- **Bursty but healthy streams**: increase `windowSeconds` before raising `minBytesPerSecond`
631
+
632
+
### Relationship to Client Disconnects
633
+
634
+
Separate from timeout/stall settings, Plexus now also cancels upstream provider requests when the downstream client disconnects during streaming. This reduces wasted quota for abandoned requests even when neither timeouts nor stall detection are enabled.
635
+
636
+
---
637
+
438
638
## MCP Servers
439
639
440
640
Plexus proxies [Model Context Protocol](https://modelcontextprotocol.io) servers. Only HTTP streaming transport is supported.
0 commit comments