Skip to content

Commit b664fb4

Browse files
jerm-droclaude
andauthored
RFC: Rate limiting for MCP servers (#57)
* RFC: Rate limiting for MCP servers Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Rename RFC to THV-0057 to match PR number Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix author GitHub handle to @jerm-dro Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * RFC updates: clarify summary, add agent exfiltration risk, wording fixes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review feedback: token bucket algorithm, scope, terminology - Adopt token bucket algorithm per Ozz's review (replaces vague windowing section) - Document burst behavior, Redis key structure, and precise Retry-After - Clarify scope is Kubernetes-based deployments (Derek's question) - Replace "operators" with "administrators" throughout (Derek's question) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Clarify scope formatting, remove vMCP non-goal, document lazy refill Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Address review feedback from yrobla and JAORMX - Define request unit (tools/call, prompts/get, resources/read only) - Rename requestsPerWindow/windowSeconds to capacity/refillPeriodSeconds - Clarify vMCP perUser limits are per ingress identity across all backends - Add missing global per-operation and per-user per-operation Redis key patterns - Document Retry-After as best-effort lower bound - Resolve Redis unavailability: fail open with structured log + metric - Add Security Considerations section with threat model - Add Alternatives Considered section (sliding window vs token bucket) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent c822120 commit b664fb4

File tree

1 file changed

+246
-0
lines changed

1 file changed

+246
-0
lines changed

rfcs/THV-0057-rate-limiting.md

Lines changed: 246 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,246 @@
1+
# RFC-0057: Rate Limiting for MCP Servers
2+
3+
- **Status**: Draft
4+
- **Author(s)**: Jeremy Drouillard (@jerm-dro)
5+
- **Created**: 2026-03-18
6+
- **Last Updated**: 2026-03-19
7+
- **Target Repository**: toolhive
8+
- **Related Issues**: None
9+
10+
## Summary
11+
12+
Enable rate limiting for `MCPServer` and `VirtualMCPServer`, supporting per-user and global limits at both the server and individual operation level. Rate limits are configured declaratively on the resource spec and enforced by a new middleware in the middleware chain, with Redis as the shared counter backend.
13+
14+
## Problem Statement
15+
16+
ToolHive currently has no mechanism to limit the rate of requests flowing through its proxy layer. This creates several risks for administrators:
17+
18+
1. **Noisy-neighbor problem**: A single authenticated user can consume unbounded resources, degrading the experience for all other users of a shared MCP server.
19+
2. **Downstream overload**: Aggregate traffic spikes — even when no single user is misbehaving — can overwhelm the upstream MCP server or the external services it depends on (APIs with their own rate limits, databases, etc.).
20+
3. **Agent data exfiltration**: AI agents can invoke tools in tight loops to export large volumes of data. Without per-tool or per-user limits, there is no mechanism to cap this behavior.
21+
22+
These risks grow as ToolHive deployments move to shared, multi-user Kubernetes environments. Without rate limiting, cluster administrators have no knob to turn between "fully open" and "take the server offline."
23+
24+
**Scope**: This RFC targets Kubernetes-based deployments of ToolHive.
25+
26+
## Goals
27+
28+
- Allow administrators to configure **per-user** rate limits so that no single user can monopolize a server.
29+
- Allow administrators to configure **global** rate limits so that total throughput stays within safe bounds for downstream services.
30+
- Allow administrators to configure rate limits **per tool**, **per prompt**, or **per resource**, so that expensive or externally-constrained operations can have tighter limits than the server default.
31+
- Provide a consistent configuration experience across `MCPServer` and `VirtualMCPServer` resources.
32+
- Enforce rate limits in the existing proxy middleware chain with minimal latency overhead.
33+
- Support correct enforcement across multiple replicas using Redis as the shared counter backend.
34+
35+
## Non-Goals
36+
37+
- **Adaptive / auto-tuning rate limits**: Automatically adjusting limits based on observed load or downstream health signals. Limits are static and administrator-configured.
38+
- **Cost or billing integration**: Tracking usage for billing purposes. This is purely a protective mechanism.
39+
- **Request queuing or throttling**: Requests that exceed the limit are rejected, not queued.
40+
41+
## Proposed Solution
42+
43+
### High-Level Design
44+
45+
Rate limiting is implemented as a new middleware in ToolHive's middleware chain. When a request arrives, the middleware checks the applicable limits (global, per-user, per-operation) and either allows the request to proceed or returns an appropriate error response.
46+
47+
```mermaid
48+
flowchart LR
49+
Client -->|request| Auth
50+
Auth -->|identified user| RateLimit
51+
RateLimit -->|allowed| Downstream[Remaining Middleware + MCP Server]
52+
RateLimit -->|rejected| Error[Error Response]
53+
```
54+
55+
The rate limit middleware sits after authentication (so user identity is available) and before the rest of the middleware chain.
56+
57+
**Request unit**: One token corresponds to one incoming `tools/call`, `prompts/get`, or `resources/read` invocation at the proxy. Other MCP methods (`tools/list`, `prompts/list`, `resources/list`, etc.) are not rate-limited — the motivation is to cap excessive use of operations that do real work, not to restrict discovery or listing. For `VirtualMCPServer`, rate limiting applies to incoming traffic only; outgoing calls to backends are not independently limited.
58+
59+
Rate limit counters are stored in Redis, which ToolHive already depends on. This ensures accurate enforcement across multiple replicas in horizontally-scaled deployments.
60+
61+
### Configuration
62+
63+
Rate limits are configured via a `rateLimiting` field on the server spec. The same structure applies to both `MCPServer` and `VirtualMCPServer`.
64+
65+
#### Server-Level Limits
66+
67+
```yaml
68+
apiVersion: toolhive.stacklok.dev/v1alpha1
69+
kind: MCPServer
70+
metadata:
71+
name: my-server
72+
spec:
73+
# ... existing fields ...
74+
rateLimiting:
75+
# Global limit: total requests across all users
76+
global:
77+
capacity: 1000
78+
refillPeriodSeconds: 60
79+
80+
# Per-user limit: applied independently to each authenticated user
81+
perUser:
82+
capacity: 100
83+
refillPeriodSeconds: 60
84+
```
85+
86+
**Validation**: Per-user rate limits require authentication to be enabled. If `perUser` limits are configured with anonymous inbound auth, the server will raise an error at startup.
87+
88+
#### Per-Operation Limits
89+
90+
Individual tools, prompts, or resources can have their own limits that supplement the server-level defaults. Per-operation limits can be either global or per-user:
91+
92+
```yaml
93+
spec:
94+
rateLimiting:
95+
perUser:
96+
capacity: 100
97+
refillPeriodSeconds: 60
98+
99+
tools:
100+
- name: "expensive_search"
101+
perUser:
102+
capacity: 10
103+
refillPeriodSeconds: 60
104+
- name: "shared_resource"
105+
global:
106+
capacity: 50
107+
refillPeriodSeconds: 60
108+
109+
prompts:
110+
- name: "generate_report"
111+
perUser:
112+
capacity: 5
113+
refillPeriodSeconds: 60
114+
115+
resources:
116+
- name: "large_dataset"
117+
global:
118+
capacity: 20
119+
refillPeriodSeconds: 60
120+
```
121+
122+
When an operation-level limit is defined, it is enforced **in addition to** any server-level limits. A request must pass all applicable limits.
123+
124+
The `tools`, `prompts`, and `resources` lists all follow the same structure — each entry specifies an operation name and either a `global` or `perUser` limit (or both).
125+
126+
#### VirtualMCPServer
127+
128+
The same `rateLimiting` configuration is available on `VirtualMCPServer`. Server-level `perUser` limits are based on the user identity at ingress (e.g. the `sub` claim of an OIDC token) and are shared across all backends — they cap the user's total usage of the vMCP, not per-backend. Per-operation limits target specific backend operations by name (e.g. `backend_a/costly_tool`), so those are inherently scoped to a single backend.
129+
130+
```yaml
131+
apiVersion: toolhive.stacklok.dev/v1alpha1
132+
kind: VirtualMCPServer
133+
metadata:
134+
name: my-vmcp
135+
spec:
136+
config:
137+
rateLimiting:
138+
perUser:
139+
capacity: 200
140+
refillPeriodSeconds: 60
141+
tools:
142+
- name: "backend_a/costly_tool"
143+
perUser:
144+
capacity: 5
145+
refillPeriodSeconds: 60
146+
147+
prompts:
148+
- name: "backend_b/heavy_prompt"
149+
global:
150+
capacity: 30
151+
refillPeriodSeconds: 60
152+
```
153+
154+
### Algorithm: Token Bucket
155+
156+
Rate limits are enforced using a **token bucket** algorithm. The configuration maps directly onto it:
157+
158+
- `capacity` = bucket capacity (maximum tokens)
159+
- `refillPeriodSeconds` = time in seconds to fully refill the bucket from zero
160+
- Refill rate = `capacity / refillPeriodSeconds` tokens per second
161+
162+
Each token represents a single allowed request. The bucket starts full, refills at a steady rate, and each incoming request consumes one token. When the bucket is empty, requests are rejected.
163+
164+
Per-user limits work identically — each user gets their own bucket, keyed by identity. Redis keys follow a structure like:
165+
166+
- Global: `thv:rl:{server}:global`
167+
- Global per-tool: `thv:rl:{server}:global:tool:{toolName}`
168+
- Global per-prompt: `thv:rl:{server}:global:prompt:{promptName}`
169+
- Global per-resource: `thv:rl:{server}:global:resource:{resourceName}`
170+
- Per-user: `thv:rl:{server}:user:{userId}`
171+
- Per-user per-tool: `thv:rl:{server}:user:{userId}:tool:{toolName}`
172+
- Per-user per-prompt: `thv:rl:{server}:user:{userId}:prompt:{promptName}`
173+
- Per-user per-resource: `thv:rl:{server}:user:{userId}:resource:{resourceName}`
174+
175+
Each bucket is stored as a Redis hash with two fields: token count and last refill timestamp. Refill is lazy — there is no background process. When a request arrives, an atomic Lua script calculates how many tokens should have accumulated since the last access based on elapsed time, adds them (capped at capacity), and then attempts to consume one. This ensures no race conditions across replicas. Keys auto-expire when inactive, so no garbage collection is needed.
176+
177+
Storage is **O(1) per counter** (two fields per hash). For example, 500 users with 10 per-operation limits = 5,000 hashes — negligible for Redis.
178+
179+
**Burst behavior**: An idle user accumulates tokens up to the bucket capacity. This means they can send a full burst of `capacity` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting capacity.
180+
181+
### Redis Unavailability
182+
183+
If Redis is unreachable, the middleware **fails open** — all requests are allowed through. This ensures a Redis hiccup does not become a full MCP outage. When this occurs, the middleware emits a structured log entry and increments a `rate_limit_redis_unavailable` metric so that operators can detect and alert on Redis health independently.
184+
185+
### Rejection Behavior
186+
187+
When a request is rate-limited, the middleware returns an MCP-appropriate error response. The middleware includes a `Retry-After` value computed as `1 / refill_rate` from the most restrictive bucket that caused the rejection. This value is a **best-effort lower bound**, not a guarantee — particularly for global limits, where other users may consume the next available token before this client retries.
188+
189+
## Security Considerations
190+
191+
### Threat Model
192+
193+
| Threat | Likelihood | Impact | Mitigation |
194+
|--------|-----------|--------|------------|
195+
| Redis key injection via crafted operation names | Medium | High | Validate and sanitize operation names before constructing Redis keys. Reject names containing key-separator characters. |
196+
| Redis as single point of compromise | Low | High | Require Redis authentication and TLS. Reuse the existing Redis security posture established in [THV-0035](./THV-0035-auth-server-redis-storage.md). |
197+
| Rate limit bypass via identity spoofing | Low | Medium | Rate limiting relies on upstream auth middleware. It is only as strong as the authentication layer. |
198+
| Denial of service via Redis exhaustion | Low | Medium | Keys auto-expire when inactive. Storage is O(1) per counter, bounding memory usage. |
199+
| Unauthenticated DoS | Medium | High | Rate limiting sits after the auth middleware, so unauthenticated requests are rejected before reaching rate limit evaluation. Global limits only count authenticated traffic. |
200+
201+
### Data Security
202+
203+
No sensitive data is stored in Redis — only token counts and last-refill timestamps. User IDs appear in Redis key names but carry no additional PII.
204+
205+
### Input Validation
206+
207+
Operation names used to construct Redis keys must be validated and sanitized to prevent key injection. Names should be checked against a strict allowlist of characters (alphanumeric, hyphens, underscores, forward slashes for vMCP backend-prefixed names).
208+
209+
### Audit and Logging
210+
211+
Rate-limit rejections are logged with user identity, operation name, and which limit was hit (global, per-user, per-operation). Token counts and refill timestamps are not included in log output.
212+
213+
## Alternatives Considered
214+
215+
### Sliding Window Log
216+
217+
The original proposal used a sliding window log algorithm, which tracks each request timestamp in a sorted set and counts entries within the window. This was replaced with a **token bucket** for the following reasons:
218+
219+
- **Burst handling**: Token bucket naturally allows legitimate bursts after idle periods, while sliding window enforces a strict ceiling regardless of usage pattern.
220+
- **Simpler storage**: Token bucket requires a single Redis hash (two fields) per counter, compared to a sorted set with one entry per request for sliding window — significantly less memory under high throughput.
221+
- **Atomic operations**: The token bucket check-and-decrement is a single Lua script operating on two fields, reducing Redis round-trip complexity.
222+
223+
## References
224+
225+
- [THV-0017: Dynamic Webhook Middleware](./THV-0017-dynamic-webhook-middleware.md) — mentions rate limiting as an external webhook use case
226+
- [THV-0047: Horizontal Scaling for vMCP and Proxy Runner](./THV-0047-vmcp-proxyrunner-horizontal-scaling.md) — relevant to distributed rate limiting concerns
227+
- [THV-0035: Auth Server Redis Storage](./THV-0035-auth-server-redis-storage.md) — existing Redis dependency
228+
- [IETF RFC 6585](https://tools.ietf.org/html/rfc6585) — HTTP 429 Too Many Requests status code
229+
230+
---
231+
232+
## RFC Lifecycle
233+
234+
<!-- This section is maintained by RFC reviewers -->
235+
236+
### Review History
237+
238+
| Date | Reviewer | Decision | Notes |
239+
|------|----------|----------|-------|
240+
| 2026-03-18 | @jerm-dro | Draft | Initial submission |
241+
242+
### Implementation Tracking
243+
244+
| Repository | PR | Status |
245+
|------------|-----|--------|
246+
| toolhive | TBD | Not started |

0 commit comments

Comments
 (0)