You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -37,6 +37,7 @@ These risks grow as ToolHive deployments move to shared, multi-user Kubernetes e
37
37
-**Adaptive / auto-tuning rate limits**: Automatically adjusting limits based on observed load or downstream health signals. Limits are static and administrator-configured.
38
38
-**Cost or billing integration**: Tracking usage for billing purposes. This is purely a protective mechanism.
39
39
-**Request queuing or throttling**: Requests that exceed the limit are rejected, not queued.
40
+
-**Rate limiting for `completion/complete`**: Completions can be high-frequency but are not a primary risk vector. Can be added in a follow-up if needed.
40
41
41
42
## Proposed Solution
42
43
@@ -47,16 +48,19 @@ Rate limiting is implemented as a new middleware in ToolHive's middleware chain.
The rate limit middleware sits after authentication (so user identity is available) and before the rest of the middleware chain.
57
+
The rate limit middleware runs after **auth** (user identity must be available for per-user limits) and after **mcp-parser** (the `ParsedMCPRequest` in context is needed to distinguish `tools/call` from `tools/list` and to resolve the operation name for per-operation limits).
56
58
57
-
**Request unit**: One token corresponds to one incoming `tools/call`, `prompts/get`, or `resources/read` invocation at the proxy. Other MCP methods (`tools/list`, `prompts/list`, `resources/list`, etc.) are not rate-limited — the motivation is to cap excessive use of operations that do real work, not to restrict discovery or listing. For `VirtualMCPServer`, rate limiting applies to incoming traffic only; outgoing calls to backends are not independently limited.
59
+
**Request unit**: One token corresponds to one incoming `tools/call`, `prompts/get`, or `resources/read` invocation at the proxy. Lifecycle and discovery methods (`initialize`, `ping`, notifications, list methods) are not rate-limited as they don't perform substantive work. For `VirtualMCPServer`, rate limiting applies to incoming traffic only; outgoing calls to backends are not independently limited.
58
60
59
-
Rate limit counters are stored in Redis, which ToolHive already depends on. This ensures accurate enforcement across multiple replicas in horizontally-scaled deployments.
61
+
**Batch JSON-RPC**: MCP transports do not use JSON-RPC batching in practice, but an acceptance test should verify that batch requests cannot be used to bypass rate limits.
62
+
63
+
Rate limit counters are stored in Redis, reusing the existing session storage Redis connection. Redis-backed session storage is a prerequisite when rate limiting is enabled — this is validated at reconciliation time. This ensures accurate enforcement across multiple replicas in horizontally-scaled deployments.
60
64
61
65
### Configuration
62
66
@@ -74,16 +78,22 @@ spec:
74
78
rateLimiting:
75
79
# Global limit: total requests across all users
76
80
global:
77
-
capacity: 1000
78
-
refillPeriodSeconds: 60
81
+
maxTokens: 1000
82
+
refillPeriod: "1m"
79
83
80
84
# Per-user limit: applied independently to each authenticated user
81
85
perUser:
82
-
capacity: 100
83
-
refillPeriodSeconds: 60
86
+
maxTokens: 100
87
+
refillPeriod: "1m"
84
88
```
85
89
86
-
**Validation**: Per-user rate limits require authentication to be enabled. If `perUser` limits are configured with anonymous inbound auth, the server will raise an error at startup.
90
+
**Validation**: Configuration errors are caught at **admission time** via CRD schema validation, giving immediate feedback on `kubectl apply` rather than requiring pod log inspection. Required validation rules:
91
+
92
+
- `perUser`rate limits require authentication to be enabled (anonymous inbound auth is invalid)
93
+
- At least one of `global` or `perUser` must be set when `rateLimiting` is present
94
+
- `maxTokens`and `refillPeriod` must meet minimum-value constraints
95
+
96
+
The operator sets a `RateLimitingConfigValid` status condition with reasons such as `RateLimitRedisNotConfigured` or `RateLimitPerUserRequiresAuth` to surface configuration issues at reconciliation time.
87
97
88
98
#### Per-Operation Limits
89
99
@@ -93,30 +103,30 @@ Individual tools, prompts, or resources can have their own limits that supplemen
93
103
spec:
94
104
rateLimiting:
95
105
perUser:
96
-
capacity: 100
97
-
refillPeriodSeconds: 60
106
+
maxTokens: 100
107
+
refillPeriod: "1m"
98
108
99
109
tools:
100
110
- name: "expensive_search"
101
111
perUser:
102
-
capacity: 10
103
-
refillPeriodSeconds: 60
112
+
maxTokens: 10
113
+
refillPeriod: "1m"
104
114
- name: "shared_resource"
105
115
global:
106
-
capacity: 50
107
-
refillPeriodSeconds: 60
116
+
maxTokens: 50
117
+
refillPeriod: "1m"
108
118
109
119
prompts:
110
120
- name: "generate_report"
111
121
perUser:
112
-
capacity: 5
113
-
refillPeriodSeconds: 60
122
+
maxTokens: 5
123
+
refillPeriod: "1m"
114
124
115
125
resources:
116
126
- name: "large_dataset"
117
127
global:
118
-
capacity: 20
119
-
refillPeriodSeconds: 60
128
+
maxTokens: 20
129
+
refillPeriod: "1m"
120
130
```
121
131
122
132
When an operation-level limit is defined, it is enforced **in addition to** any server-level limits. A request must pass all applicable limits.
@@ -125,66 +135,103 @@ The `tools`, `prompts`, and `resources` lists all follow the same structure —
125
135
126
136
#### VirtualMCPServer
127
137
128
-
The same `rateLimiting` configuration is available on `VirtualMCPServer`. Server-level `perUser` limits are based on the user identity at ingress (e.g. the `sub` claim of an OIDC token) and are shared across all backends — they cap the user's total usage of the vMCP, not per-backend. Per-operation limits target specific backend operations by name (e.g. `backend_a/costly_tool`), so those are inherently scoped to a single backend.
138
+
The same `rateLimiting` configuration is available on `VirtualMCPServer`. Server-level `perUser` limits are based on the user identity at ingress (e.g. the `sub` claim of an OIDC token) and are shared across all backends — they cap the user's total usage of the vMCP, not per-backend. Per-operation limits use post-aggregation tool names matching the configured `PrefixFormat` (default underscore separator, e.g. `backend_a_costly_tool`), so those are inherently scoped to a single backend.
139
+
140
+
**Optimizer interaction**: When the optimizer is enabled, clients call `call_tool`/`find_tool` meta-tools instead of backend tools directly. Per-tool rate limits extract the inner `tool_name` from `call_tool` arguments to enforce limits on the real backend tool, following the same pattern as the authz middleware ([stacklok/toolhive#4385](https://github.com/stacklok/toolhive/pull/4385)). This is tech debt — the cross-cutting interaction between middleware and optimizer naming is a known problem that [THV-0060](https://github.com/stacklok/toolhive-rfcs/pull/60) aims to resolve structurally.
129
141
130
142
```yaml
131
143
apiVersion: toolhive.stacklok.dev/v1alpha1
132
144
kind: VirtualMCPServer
133
145
metadata:
134
146
name: my-vmcp
135
147
spec:
136
-
config:
137
-
rateLimiting:
138
-
perUser:
139
-
capacity: 200
140
-
refillPeriodSeconds: 60
141
-
tools:
142
-
- name: "backend_a/costly_tool"
143
-
perUser:
144
-
capacity: 5
145
-
refillPeriodSeconds: 60
146
-
147
-
prompts:
148
-
- name: "backend_b/heavy_prompt"
149
-
global:
150
-
capacity: 30
151
-
refillPeriodSeconds: 60
148
+
rateLimiting:
149
+
perUser:
150
+
maxTokens: 200
151
+
refillPeriod: "1m"
152
+
tools:
153
+
- name: "backend_a_costly_tool"
154
+
perUser:
155
+
maxTokens: 5
156
+
refillPeriod: "1m"
157
+
158
+
prompts:
159
+
- name: "backend_b_heavy_prompt"
160
+
global:
161
+
maxTokens: 30
162
+
refillPeriod: "1m"
152
163
```
153
164
154
165
### Algorithm: Token Bucket
155
166
156
167
Rate limits are enforced using a **token bucket** algorithm. The configuration maps directly onto it:
157
168
158
-
- `capacity`= bucket capacity (maximum tokens)
159
-
- `refillPeriodSeconds`= time in seconds to fully refill the bucket from zero
160
-
- Refill rate = `capacity / refillPeriodSeconds` tokens per second
169
+
- `maxTokens`= bucket capacity (maximum tokens)
170
+
- `refillPeriod`= duration to fully refill the bucket from zero (e.g. `"1m"`, `"1h"`)
171
+
- Refill rate = `maxTokens / refillPeriod` tokens per second
Each token represents a single allowed request. The bucket starts full, refills at a steady rate, and each incoming request consumes one token. When the bucket is empty, requests are rejected.
163
176
164
177
Per-user limits work identically — each user gets their own bucket, keyed by identity. Redis keys follow a structure like:
165
178
166
-
- Global: `thv:rl:{server}:global`
167
-
- Global per-tool: `thv:rl:{server}:global:tool:{toolName}`
168
-
- Global per-prompt: `thv:rl:{server}:global:prompt:{promptName}`
169
-
- Global per-resource: `thv:rl:{server}:global:resource:{resourceName}`
Each bucket is stored as a Redis hash with two fields: token count and last refill timestamp. Refill is lazy — there is no background process. When a request arrives, an atomic Lua script calculates how many tokens should have accumulated since the last access based on elapsed time, adds them (capped at capacity), and then attempts to consume one. This ensures no race conditions across replicas. Keys auto-expire when inactive, so no garbage collection is needed.
188
+
The `{namespace}` and `{server}` components are derived from the CRD metadata at middleware initialization time, never from per-request input.
189
+
190
+
Each bucket is stored as a Redis hash with two fields: token count and last refill timestamp. Refill is lazy — there is no background process. When a request arrives, an atomic Lua script calculates how many tokens should have accumulated since the last access based on elapsed time, adds them (capped at maxTokens), and then attempts to consume one. The Lua script uses `redis.call('TIME')` for all timestamp calculations to avoid clock skew across replicas, and handles the key-does-not-exist case (fresh bucket at full capacity) within the same EVAL. This ensures no race conditions across replicas. Keys auto-expire after `2 * refillPeriod` (enough for a full refill cycle plus buffer), so no garbage collection is needed.
176
191
177
192
Storage is **O(1) per counter** (two fields per hash). For example, 500 users with 10 per-operation limits = 5,000 hashes — negligible for Redis.
178
193
179
-
**Burst behavior**: An idle user accumulates tokens up to the bucket capacity. This means they can send a full burst of `capacity` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting capacity.
194
+
**Burst behavior**: An idle user accumulates tokens up to the bucket maxTokens. This means they can send a full burst of `maxTokens` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting maxTokens.
180
195
181
196
### Redis Unavailability
182
197
183
-
If Redis is unreachable, the middleware **fails open** — all requests are allowed through. This ensures a Redis hiccup does not become a full MCP outage. When this occurs, the middleware emits a structured log entry and increments a `rate_limit_redis_unavailable` metric so that operators can detect and alert on Redis health independently.
198
+
If Redis is unreachable, the middleware **fails open** — all requests are allowed through. This ensures a Redis hiccup does not become a full MCP outage. The middleware logs state transitions (first failure and recovery) rather than every failed check to avoid log noise during sustained outages, and increments a `toolhive_rate_limit_redis_unavailable_total` metric so that operators can detect and alert on Redis health independently.
199
+
200
+
**Redis failover**: During Sentinel failover, the new primary has no rate limit state. All users receive a fresh burst allowance. This is expected behavior — bucket state is ephemeral and refills naturally.
201
+
202
+
### Observability
203
+
204
+
The rate limiting middleware exposes the following metric categories (specific names and labels to be finalized at implementation time, following the `toolhive_` prefix convention):
184
205
185
-
### Rejection Behavior
206
+
- **Decision counter**: Total rate limit decisions (allowed/rejected), broken down by scope and operation type
207
+
- **Redis error counter**: Redis failures by error type (timeout, connection refused, auth failure)
208
+
- **Fail-open counter**: Requests allowed through without rate limit enforcement during Redis outage
209
+
- **Check latency histogram**: Lua script round-trip latency (hot path for every rate-limited request)
186
210
187
-
When a request is rate-limited, the middleware returns an MCP-appropriate error response. The middleware includes a `Retry-After` value computed as `1 / refill_rate` from the most restrictive bucket that caused the rejection. This value is a **best-effort lower bound**, not a guarantee — particularly for global limits, where other users may consume the next available token before this client retries.
211
+
**Tracing**: The middleware adds span attributes to the existing request span: `rate_limit.decision`, `rate_limit.rejected_by`, `rate_limit.fail_open`.
212
+
213
+
### Rejection Response Format
214
+
215
+
When a request is rate-limited, the middleware returns a **JSON-RPC 2.0 error response** (not a tool result with `isError: true` — the tool was never invoked):
216
+
217
+
```json
218
+
{
219
+
"jsonrpc": "2.0",
220
+
"id": 2,
221
+
"error": {
222
+
"code": -32029,
223
+
"message": "Rate limit exceeded",
224
+
"data": {
225
+
"retryAfterSeconds": 0.6,
226
+
"limitType": "perUser"
227
+
}
228
+
}
229
+
}
230
+
```
231
+
232
+
- **Error code**: `-32029` (in the `-32000` to `-32099` range reserved for implementation-defined server errors).
233
+
- **`Retry-After`**: Placed in `error.data`, not HTTP headers, so it works across all transports (stdio, SSE, Streamable HTTP). The value is computed as `1 / refill_rate` from the most restrictive bucket that caused the rejection. This is a **best-effort lower bound**, not a guarantee — particularly for global limits, where other users may consume the next available token before this client retries.
234
+
- **HTTP 429**: For Streamable HTTP transport, the middleware additionally returns HTTP 429 with a `Retry-After` header as a supplementary transport-level signal.
188
235
189
236
## Security Considerations
190
237
@@ -193,11 +240,15 @@ When a request is rate-limited, the middleware returns an MCP-appropriate error
193
240
| Threat | Likelihood | Impact | Mitigation |
194
241
|--------|-----------|--------|------------|
195
242
| Redis key injection via crafted operation names | Medium | High | Validate and sanitize operation names before constructing Redis keys. Reject names containing key-separator characters. |
196
-
| Redis as single point of compromise | Low | High | Require Redis authentication and TLS. Reuse the existing Redis security posture established in [THV-0035](./THV-0035-auth-server-redis-storage.md). |
243
+
| Redis as single point of compromise | Low | High | Reuse existing Redis security posture. Rate limiting shares the session storage Redis connection and follows the same security approach (authentication, network-level isolation). |
197
244
| Rate limit bypass via identity spoofing | Low | Medium | Rate limiting relies on upstream auth middleware. It is only as strong as the authentication layer. |
198
245
| Denial of service via Redis exhaustion | Low | Medium | Keys auto-expire when inactive. Storage is O(1) per counter, bounding memory usage. |
199
246
| Unauthenticated DoS | Medium | High | Rate limiting sits after the auth middleware, so unauthenticated requests are rejected before reaching rate limit evaluation. Global limits only count authenticated traffic. |
200
247
248
+
### Scope and Limitations
249
+
250
+
Rate limiting operates at the MCP method level. Transport-level attacks (connection exhaustion, slowloris), Redis availability, and IDP identity consistency (e.g. multiple `sub` claims for one human) are outside the scope of rate limiting and addressed by other layers.
251
+
201
252
### Data Security
202
253
203
254
No sensitive data is stored in Redis — only token counts and last-refill timestamps. User IDs appear in Redis key names but carry no additional PII.
@@ -220,6 +271,25 @@ The original proposal used a sliding window log algorithm, which tracks each req
220
271
- **Simpler storage**: Token bucket requires a single Redis hash (two fields) per counter, compared to a sorted set with one entry per request for sliding window — significantly less memory under high throughput.
221
272
- **Atomic operations**: The token bucket check-and-decrement is a single Lua script operating on two fields, reducing Redis round-trip complexity.
222
273
274
+
## Plan
275
+
276
+
The initial implementation targets **`MCPServer`** and **`VirtualMCPServer`** as the priority. The following are deferred for future work:
277
+
278
+
- **`MCPRemoteProxy` support**: `MCPRemoteProxy` shares the same `MiddlewareFactory` registry and transparent proxy chain as `MCPServer`, so integrating rate limiting is straightforward. Deferred until demand warrants it.
279
+
- **In-memory rate limiting for CLI / single-replica deployments**: The primary motivation for rate limiting is multi-user K8s environments where noisy-neighbor and exfiltration risks are most acute. CLI mode is single-user/single-process where rate limiting adds less value. Could be added using the same dual-mode pattern as session storage (memory vs Redis).
280
+
281
+
### External Rate Limiting (Envoy/Istio)
282
+
283
+
Infrastructure-level rate limiting operates on HTTP request counts, not MCP method semantics. It cannot distinguish `tools/call` from `tools/list`, cannot enforce per-tool limits, and requires infrastructure-level configuration that operators may not control. Application-level rate limiting is needed for MCP-aware enforcement.
284
+
285
+
### Webhook-Based Rate Limiting (THV-0017)
286
+
287
+
The [dynamic webhook middleware](./THV-0017-dynamic-webhook-middleware.md) allows external services to participate in the request pipeline, and rate limiting is listed as a use case. However, adding a network round-trip to an external rate limiter on every request adds significant latency to the hot path. Rate limiting is a core operational concern (like auth and audit) that belongs in the middleware chain. THV-0017 webhooks remain useful for custom rate limiting logic beyond what this RFC provides.
288
+
289
+
### Fixed Window Counters
290
+
291
+
Simpler than token bucket but suffers from the boundary burst problem — a user can send 2x the limit by timing requests at the edge of two adjacent windows. Token bucket provides smoother rate enforcement with natural burst handling.
292
+
223
293
## References
224
294
225
295
- [THV-0017: Dynamic Webhook Middleware](./THV-0017-dynamic-webhook-middleware.md) — mentions rate limiting as an external webhook use case
0 commit comments