Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 246 additions & 0 deletions rfcs/THV-0057-rate-limiting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# RFC-0057: Rate Limiting for MCP Servers

- **Status**: Draft
- **Author(s)**: Jeremy Drouillard (@jerm-dro)
- **Created**: 2026-03-18
- **Last Updated**: 2026-03-19
- **Target Repository**: toolhive
- **Related Issues**: None

## Summary

Enable rate limiting for `MCPServer` and `VirtualMCPServer`, supporting per-user and global limits at both the server and individual operation level. Rate limits are configured declaratively on the resource spec and enforced by a new middleware in the middleware chain, with Redis as the shared counter backend.

## Problem Statement

ToolHive currently has no mechanism to limit the rate of requests flowing through its proxy layer. This creates several risks for administrators:

1. **Noisy-neighbor problem**: A single authenticated user can consume unbounded resources, degrading the experience for all other users of a shared MCP server.
2. **Downstream overload**: Aggregate traffic spikes — even when no single user is misbehaving — can overwhelm the upstream MCP server or the external services it depends on (APIs with their own rate limits, databases, etc.).
3. **Agent data exfiltration**: AI agents can invoke tools in tight loops to export large volumes of data. Without per-tool or per-user limits, there is no mechanism to cap this behavior.

These risks grow as ToolHive deployments move to shared, multi-user Kubernetes environments. Without rate limiting, cluster administrators have no knob to turn between "fully open" and "take the server offline."

**Scope**: This RFC targets Kubernetes-based deployments of ToolHive.

## Goals

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCPRemoteProxy Not Addressed

[Kubernetes Expert] MCPRemoteProxy also proxies MCP traffic through the middleware chain and would benefit from the same rate limiting. The RFC should either explicitly include it or document why it's excluded.

Also — should in-memory rate limiting be available for non-K8s / single-replica deployments? The session storage system already has this dual-mode pattern (memory vs Redis). Scoping exclusively to K8s means CLI-mode users get no rate limiting at all.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on MCPRemoteProxy — it shares the same MiddlewareFactory registry and transparent proxy chain as MCPServer, so integrating it should be straightforward. That said, the priorities for this RFC are VirtualMCPServer and MCPServer. I've added a Plan section noting that MCPRemoteProxy support is deferred but can be integrated as a follow-up since the middleware architecture is the same.

On in-memory rate limiting for CLI mode — same answer: deferred. The primary motivation is multi-user K8s environments where noisy-neighbor and exfiltration risks are most acute.

- Allow administrators to configure **per-user** rate limits so that no single user can monopolize a server.
- Allow administrators to configure **global** rate limits so that total throughput stays within safe bounds for downstream services.
- Allow administrators to configure rate limits **per tool**, **per prompt**, or **per resource**, so that expensive or externally-constrained operations can have tighter limits than the server default.
- Provide a consistent configuration experience across `MCPServer` and `VirtualMCPServer` resources.
- Enforce rate limits in the existing proxy middleware chain with minimal latency overhead.
- Support correct enforcement across multiple replicas using Redis as the shared counter backend.

## Non-Goals

- **Adaptive / auto-tuning rate limits**: Automatically adjusting limits based on observed load or downstream health signals. Limits are static and administrator-configured.
- **Cost or billing integration**: Tracking usage for billing purposes. This is purely a protective mechanism.
- **Request queuing or throttling**: Requests that exceed the limit are rejected, not queued.

## Proposed Solution

### High-Level Design

Rate limiting is implemented as a new middleware in ToolHive's middleware chain. When a request arrives, the middleware checks the applicable limits (global, per-user, per-operation) and either allows the request to proceed or returns an appropriate error response.

```mermaid
flowchart LR
Client -->|request| Auth
Auth -->|identified user| RateLimit
RateLimit -->|allowed| Downstream[Remaining Middleware + MCP Server]
RateLimit -->|rejected| Error[Error Response]
```

The rate limit middleware sits after authentication (so user identity is available) and before the rest of the middleware chain.

**Request unit**: One token corresponds to one incoming `tools/call`, `prompts/get`, or `resources/read` invocation at the proxy. Other MCP methods (`tools/list`, `prompts/list`, `resources/list`, etc.) are not rate-limited — the motivation is to cap excessive use of operations that do real work, not to restrict discovery or listing. For `VirtualMCPServer`, rate limiting applies to incoming traffic only; outgoing calls to backends are not independently limited.

Rate limit counters are stored in Redis, which ToolHive already depends on. This ensures accurate enforcement across multiple replicas in horizontally-scaled deployments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dual Middleware Chain Integration Undefined

[ToolHive Expert] MCPServer and VirtualMCPServer have structurally different middleware chains:

  • MCPServer: Uses MiddlewareFactory registry in pkg/runner/middleware.go (PopulateMiddlewareConfigs()GetSupportedMiddlewareFactories()). Chain applied in reverse in pkg/transport/proxy/transparent/transparent_proxy.go.
  • VirtualMCPServer: Manually wraps http.Handler in pkg/vmcp/server/server.go's Handler() method. Does NOT use the MiddlewareFactory pattern.

The RFC must describe integration into both chains.

Also, the rate limiter needs access to the parsed MCP request (to know tools/call vs tools/list). The mcp-parser middleware stores ParsedMCPRequest in context, so it must run before the rate limiter. This ordering dependency should be explicit.

A full middleware chain diagram (like THV-0017 provides) showing exactly where rate limiting sits relative to auth, mcp-parser, tool-filter, webhooks, telemetry, authz, and audit would be very helpful here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points on the ordering dependency — the rate limiter depends on the parsed MCP request from mcp-parser to distinguish tools/call from tools/list, and on auth for user identity. I've updated the RFC to make this ordering constraint explicit and updated the diagram to show Auth → MCP Parser → RateLimit → [remaining chain].

The specific integration mechanics for each middleware chain (MiddlewareFactory vs manual handler wrapping) are implementation details — the architectural constraint is that rate limiting must run after both auth and mcp-parser regardless of chain structure, and I've captured that.

### Configuration

Rate limits are configured via a `rateLimiting` field on the server spec. The same structure applies to both `MCPServer` and `VirtualMCPServer`.

#### Server-Level Limits
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method Coverage Gaps

[MCP Protocol Expert] The selected methods (tools/call, prompts/get, resources/read) are correct — the MCP spec's own Security Considerations explicitly states "Servers MUST rate limit tool invocations." Good.

However, two issues:

  1. completion/complete is missing. The MCP spec says "Servers SHOULD rate limit completion requests" and "Clients SHOULD debounce rapid completion requests." Either include it or explicitly document why it's excluded.

  2. The RFC should enumerate methods that are explicitly NEVER rate-limited, with rationale:

    • initialize / initialized — blocking these prevents session establishment entirely
    • ping — spec says receivers MUST respond promptly; rate-limiting pings causes false connection-death detection
    • All notifications/* — fire-and-forget, no response to reject
    • All server-to-client methods (sampling/createMessage, elicitation/create)
    • resources/subscribe / resources/unsubscribe — worth calling out explicitly since unbounded subscriptions could be a resource exhaustion vector even if not rate-limited

Also: how does this interact with batch JSON-RPC requests? A batch containing both tools/call and tools/list — is the rate limit applied per-call-in-batch or per-batch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. completion/complete: Acknowledged, but keeping this out of scope for now. The primary risk vectors are tools/call, prompts/get, and resources/read. Completions can be added in a follow-up if needed. Added to Non-Goals.

  2. Explicit exclusion list: Lifecycle and discovery methods (initialize, ping, notifications, list methods) are not rate-limited as they don't perform substantive work. Added a one-liner to the RFC.

  3. Batch JSON-RPC: MCP transports don't use JSON-RPC batching in practice, but we should verify this isn't a bypass vector. Added a note to include an acceptance test confirming batch requests cannot circumvent rate limits.


```yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redis Connection Strategy Unspecified

[ToolHive Expert, Kubernetes Expert] The RFC doesn't specify whether rate limiting creates its own Redis client, shares the session storage connection, or shares the auth server connection (THV-0035).

ToolHive currently has two Redis integration points:

  • Auth Server (THV-0035): redis.NewFailoverClient with Sentinel-only, ACL user auth
  • Session Storage: pkg/transport/session/redis_config.go, supports both standalone and Sentinel

Creating a separate connection pool would be wasteful. Recommendation: reuse the existing SessionStorageConfig Redis connection and validate at reconciliation time that Redis-backed session storage is configured when rateLimiting is enabled.

Also: rate limiting is a write-heavy workload (every request does read-modify-write via Lua). Auth storage is read-heavy. If sharing Redis, this workload difference should be acknowledged. Should a dedicated Redis instance be recommended for high-throughput deployments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. The intent is to reuse the existing session storage Redis connection — rate limiting will validate at reconciliation time that Redis-backed session storage is configured. The write-heavy workload concern is valid but the actual volume is modest — one Lua EVAL per rate-limited request on a tiny hash. For most deployments this is negligible alongside session storage. We'll revisit a dedicated Redis option if workload separation is justified by real usage patterns.

Updated the RFC to state that rate limiting reuses the session storage Redis connection and that Redis-backed session storage is a prerequisite.

kind: MCPServer
metadata:
name: my-server
spec:
# ... existing fields ...
rateLimiting:
# Global limit: total requests across all users
global:
capacity: 1000
refillPeriodSeconds: 60

# Per-user limit: applied independently to each authenticated user
perUser:
capacity: 100
refillPeriodSeconds: 60
```

**Validation**: Per-user rate limits require authentication to be enabled. If `perUser` limits are configured with anonymous inbound auth, the server will raise an error at startup.

#### Per-Operation Limits
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use Duration Type Instead of Integer Seconds

[Kubernetes Expert] refillPeriodSeconds: 3600 is less readable than refillPeriod: "1h". The project already has a config.Duration type in pkg/vmcp/config/config.go that serializes as Go duration strings, and Kubernetes conventions prefer metav1.Duration for API types.

Replace refillPeriodSeconds (int) with refillPeriod (Duration) for consistency.

Also: capacity is ambiguous in Kubernetes API context (could be confused with storage/resource capacity). Consider renaming to maxTokens or bucketSize. Either way, add +kubebuilder:validation:Minimum=1 markers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on the duration type — refillPeriod: "1h" is more readable and consistent with existing ToolHive conventions. Renamed throughout: refillPeriodSecondsrefillPeriod (Duration), capacitymaxTokens. All YAML examples updated to use duration strings. Added a note that both fields require minimum-value validation.


Individual tools, prompts, or resources can have their own limits that supplement the server-level defaults. Per-operation limits can be either global or per-user:

```yaml
spec:
rateLimiting:
perUser:
capacity: 100
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation Should Be at Admission Time, Not Startup

[Kubernetes Expert] In the Kubernetes model, this should be caught at admission time via CEL validation rules on the CRD, not at container startup. Example:

// +kubebuilder:validation:XValidation:rule="!has(self.rateLimiting) || !has(self.rateLimiting.perUser) || has(self.oidcConfig)",message="perUser rate limiting requires authentication"

This gives immediate feedback via kubectl apply instead of requiring pod log inspection.

Additional validation needed:

  • +kubebuilder:validation:Minimum=1 on capacity and refill period
  • +kubebuilder:validation:Required + MinLength=1 on per-operation name fields
  • +kubebuilder:validation:MaxItems on tools/prompts/resources arrays
  • CEL rule ensuring at least one of global/perUser is set when rateLimiting is present

Also missing: status conditions. Following established patterns (ConditionSessionStorageWarning, ConditionTypeExternalAuthConfigValidated), define a RateLimitingConfigValid condition with reasons like RateLimitRedisNotConfigured, RateLimitPerUserRequiresAuth.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that admission-time validation is better UX than startup errors. Updated the validation section to specify: (1) validation happens at admission time via CRD schema validation, (2) required rules listed at a high level (perUser requires auth, at least one of global/perUser, minimum values), and (3) a RateLimitingConfigValid status condition with expected reasons. Specific CEL rules and kubebuilder markers are deferred to the implementation PRs.

refillPeriodSeconds: 60

tools:
- name: "expensive_search"
perUser:
capacity: 10
refillPeriodSeconds: 60
- name: "shared_resource"
global:
capacity: 50
refillPeriodSeconds: 60

prompts:
- name: "generate_report"
perUser:
capacity: 5
refillPeriodSeconds: 60

resources:
- name: "large_dataset"
global:
capacity: 20
refillPeriodSeconds: 60
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic Multi-Limit Check Required

[Security Advisor] If global, per-user, and per-operation limits are checked in separate Redis calls, a race condition exists where a request passes the per-operation check but the global counter has been exhausted by a concurrent request. This is bypass vector BV-3.

The Lua script must atomically check ALL applicable limits (global + per-user + per-operation) in a single EVAL call, accept all applicable keys and limits as parameters, and return a single allow/deny decision. Split checks create exploitable race conditions in multi-replica deployments.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a real concern in theory. In practice, phantom token consumption from sequential bucket checks is minimal since tokens refill continuously. We'll keep the implementation non-atomic for simplicity and revisit based on experience if the trade-off proves problematic under real workloads.


When an operation-level limit is defined, it is enforced **in addition to** any server-level limits. A request must pass all applicable limits.

The `tools`, `prompts`, and `resources` lists all follow the same structure — each entry specifies an operation name and either a `global` or `perUser` limit (or both).

#### VirtualMCPServer

The same `rateLimiting` configuration is available on `VirtualMCPServer`. Server-level `perUser` limits are based on the user identity at ingress (e.g. the `sub` claim of an OIDC token) and are shared across all backends — they cap the user's total usage of the vMCP, not per-backend. Per-operation limits target specific backend operations by name (e.g. `backend_a/costly_tool`), so those are inherently scoped to a single backend.

```yaml
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: VirtualMCPServer
metadata:
name: my-vmcp
spec:
config:
rateLimiting:
perUser:
capacity: 200
refillPeriodSeconds: 60
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VirtualMCPServer Field Placement Is Inconsistent

[Kubernetes Expert, ToolHive Expert] This places rateLimiting at spec.config.rateLimiting, but for MCPServer it's at spec.rateLimiting. The spec.config field on VirtualMCPServer is the platform-agnostic config.Config struct from pkg/vmcp/config, and it has +kubebuilder:pruning:PreserveUnknownFields — meaning rate limiting fields inside it would bypass CRD structural schema validation entirely.

The established pattern is that K8s-native operational concerns live on the spec directly (e.g., incomingAuth, outgoingAuth, sessionStorage, replicas are all on VirtualMCPServerSpec, NOT inside config).

Recommendation: Place rateLimiting at spec.rateLimiting on both MCPServer and VirtualMCPServer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — placing it inside spec.config would bypass CRD structural schema validation since Config has PreserveUnknownFields. Following the established pattern (incomingAuth, outgoingAuth, sessionAffinity on VirtualMCPServerSpec directly), I've moved rateLimiting to spec.rateLimiting on both resource types.

tools:
- name: "backend_a/costly_tool"
perUser:
capacity: 5
refillPeriodSeconds: 60

prompts:
- name: "backend_b/heavy_prompt"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tool Naming Convention Conflict + Two-Phase Check Problem

[ToolHive Expert, MCP Protocol Expert] Two issues with backend_a/costly_tool:

  1. Naming conflict: vMCP aggregation uses dot separator (backend_a.costly_tool) for conflict resolution prefix strategy. The MCP spec's recommended tool name characters are A-Z, a-z, 0-9, _, -, . — forward slash (/) is not in the allowed set. The rate limit config naming must match the actual aggregated tool names.

  2. Two-phase check problem: Global/perUser limits can be checked pre-routing, but per-backend tool limits require knowing which backend was selected (only known after the vMCP router runs). This means the rate limit check may need to happen at two points in the request lifecycle: pre-route (global/perUser) and post-route (per-backend tool). This is a significant architectural consideration that needs explicit design.

Additionally: what happens when a tool exists on multiple backends (name collision after aggregation)? And what about interaction with the vMCP optimizer — should limits apply to call_tool (optimizer wrapper) or the underlying backend dispatch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Naming: Good catch — fixed examples to use underscore separator matching the default PrefixConflictResolver (e.g. backend_a_costly_tool). Added a note that per-operation names match the post-aggregation tool name as seen by the client.

  2. Two-phase check: No two-phase problem — rate limiting uses the post-aggregation tool name. When the optimizer is enabled, the rate limiter operates on the call_tool line and extracts the inner tool_name argument for per-tool limits, following the same pattern as the authz middleware (stacklok/toolhive#4385). This is acknowledged as tech debt — the cross-cutting interaction between middleware and optimizer naming is a known problem that THV-0060 aims to resolve structurally.

  3. Optimizer: Rate limits apply to incoming requests at the optimizer meta-tool level; optimizer dispatch is not independently rate-limited.

global:
capacity: 30
refillPeriodSeconds: 60
```

### Algorithm: Token Bucket

Rate limits are enforced using a **token bucket** algorithm. The configuration maps directly onto it:

- `capacity` = bucket capacity (maximum tokens)
- `refillPeriodSeconds` = time in seconds to fully refill the bucket from zero
- Refill rate = `capacity / refillPeriodSeconds` tokens per second

Each token represents a single allowed request. The bucket starts full, refills at a steady rate, and each incoming request consumes one token. When the bucket is empty, requests are rejected.

Per-user limits work identically — each user gets their own bucket, keyed by identity. Redis keys follow a structure like:

- Global: `thv:rl:{server}:global`
- Global per-tool: `thv:rl:{server}:global:tool:{toolName}`
- Global per-prompt: `thv:rl:{server}:global:prompt:{promptName}`
- Global per-resource: `thv:rl:{server}:global:resource:{resourceName}`
- Per-user: `thv:rl:{server}:user:{userId}`
- Per-user per-tool: `thv:rl:{server}:user:{userId}:tool:{toolName}`
- Per-user per-prompt: `thv:rl:{server}:user:{userId}:prompt:{promptName}`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Namespace Missing from Redis Key Schema

[Security Advisor, Kubernetes Expert] THV-0035 established the convention thv:auth:{server-ns}:{server-name}: — the namespace is included to prevent cross-namespace collisions. This schema uses thv:rl:{server}: which omits namespace.

If an MCPServer named "foo" in namespace "team-a" and a VirtualMCPServer named "foo" in namespace "team-b" share the same Redis, their rate limit counters would collide.

Must be: thv:rl:{namespace}:{server-name}:{...}

Also:

  • Specify the Redis key TTL (auto-expire is mentioned but the actual duration is undefined — should be a multiple of refillPeriodSeconds)
  • Confirm the {server} component is derived from the CRD name at startup, never from per-request input (request headers, URL paths, etc.) to prevent key namespace spoofing

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — namespace must be included to prevent cross-namespace collisions. Updated all key patterns to thv:rl:{namespace}:{server}:{...}.

On TTL: keys auto-expire at 2 * refillPeriod.

On key derivation: confirmed — {namespace} and {server} are derived from CRD metadata at middleware initialization time, never from per-request input. Added this to the RFC.

- Per-user per-resource: `thv:rl:{server}:user:{userId}:resource:{resourceName}`

Each bucket is stored as a Redis hash with two fields: token count and last refill timestamp. Refill is lazy — there is no background process. When a request arrives, an atomic Lua script calculates how many tokens should have accumulated since the last access based on elapsed time, adds them (capped at capacity), and then attempts to consume one. This ensures no race conditions across replicas. Keys auto-expire when inactive, so no garbage collection is needed.

Storage is **O(1) per counter** (two fields per hash). For example, 500 users with 10 per-operation limits = 5,000 hashes — negligible for Redis.

**Burst behavior**: An idle user accumulates tokens up to the bucket capacity. This means they can send a full burst of `capacity` requests at once after a period of inactivity. This is intentional — it handles legitimate traffic spikes — but administrators should understand it when setting capacity.

### Redis Unavailability

If Redis is unreachable, the middleware **fails open** — all requests are allowed through. This ensures a Redis hiccup does not become a full MCP outage. When this occurs, the middleware emits a structured log entry and increments a `rate_limit_redis_unavailable` metric so that operators can detect and alert on Redis health independently.

### Rejection Behavior

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lua Script: Clock Skew and Atomicity Requirements

[Security Advisor] Two concerns with the Lua script:

  1. Clock skew: If token bucket refill uses time.Now() from Go replicas rather than redis.call('TIME'), clock skew between horizontally-scaled replicas causes inconsistent refill rates. An attacker targeting a replica with a fast clock gets more tokens. The Lua script should use Redis server time for refill calculations.

  2. Key initialization race: If a key expires between the token check and the decrement in a non-atomic operation, the counter resets to zero, granting a fresh burst. The Lua script must handle the key-does-not-exist case (first request or post-expiry) within the same atomic operation.

Also: should the script be loaded via SCRIPT LOAD + EVALSHA (avoiding re-parsing per call)? If the script changes between ToolHive versions, EVALSHA returns NOSCRIPT — is the fallback to EVAL handled?

The RFC should include the actual Lua script (or pseudocode) since correctness of this atomic operation is critical to the entire design.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points on clock skew and key initialization. Agree at a high level — the Lua script should use redis.call('TIME') for consistency across replicas and handle key-not-exists atomically. Added brief notes to the RFC on both. The specifics (pseudocode, EVALSHA handling) are best sorted out in code review when we're looking at real code.

When a request is rate-limited, the middleware returns an MCP-appropriate error response. The middleware includes a `Retry-After` value computed as `1 / refill_rate` from the most restrictive bucket that caused the rejection. This value is a **best-effort lower bound**, not a guarantee — particularly for global limits, where other users may consume the next available token before this client retries.

## Security Considerations

### Threat Model

| Threat | Likelihood | Impact | Mitigation |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail-Open Needs More Design Work

[Security Advisor, SRE] Three concerns:

  1. Configurable fail policy: Add failPolicy: open|closed (default: open). Security-sensitive deployments need the option to reject requests when rate limiting can't be enforced. Fail-open means an attacker who can cause Redis unavailability (connection exhaustion, OOM, network partition) effectively disables all rate limiting.

  2. Log noise: During sustained Redis outage at 1,000 req/s, logging every fail-open event generates 1,000 WARN lines/second. Specify rate-sampled logging (e.g., log once per second) or a circuit breaker pattern — after N consecutive failures, enter stable fail-open state with aggressive alerting rather than checking Redis on every request.

  3. Redis failover bucket reset: During Sentinel failover, the new primary has no data. All users get a full burst allowance. This is likely acceptable but should be stated explicitly.

Also: what happens when Redis recovers? Log that connectivity is restored (INFO level).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it simple for now: fail-open only. Adding a configurable failPolicy introduces complexity that isn't justified until a concrete need arises.

Updated the RFC to specify state-transition logging (log first failure and recovery, not every request) to avoid log noise during sustained outages. Also added a note that Sentinel failover resets bucket state — all users receive fresh burst allowance. This is expected behavior since bucket state is ephemeral.

|--------|-----------|--------|------------|
| Redis key injection via crafted operation names | Medium | High | Validate and sanitize operation names before constructing Redis keys. Reject names containing key-separator characters. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observability Is Severely Underspecified

[SRE] A single metric is not sufficient to operate a rate limiting system in production. Required additions:

Counters (must-have):

  • toolhive_rate_limit_decisions_total{server, decision=allowed|rejected, scope=global|per_user, operation_type} — primary SLI
  • toolhive_rate_limit_redis_errors_total{server, error_type=timeout|connection_refused|auth_failure} — failure breakdown
  • toolhive_rate_limit_fail_open_total{server} — unmetered requests during Redis outage

Histograms (must-have):

  • toolhive_rate_limit_check_duration_seconds{server, decision} — Lua script round-trip latency (this is on the hot path for every request!)

Naming: The current rate_limit_redis_unavailable violates project conventions — must use toolhive_ prefix and _total suffix per Prometheus naming conventions.

Tracing: Add span attributes on the existing request span: rate_limit.decision, rate_limit.rejected_by, rate_limit.fail_open.

Alerting: The RFC should document recommended alerts — at minimum RateLimitRedisUnavailable (page) and RateLimitHighRejectionRate (ticket).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair feedback — the observability section was too thin. Added a high-level Observability subsection listing the key metric categories (decision counters, Redis error counters, fail-open counters, check latency histogram) and span attributes. Also fixed the metric naming to follow the toolhive_ prefix convention.

Specific metric names, label cardinality, and alert definitions will be sorted out at implementation time.

| Redis as single point of compromise | Low | High | Require Redis authentication and TLS. Reuse the existing Redis security posture established in [THV-0035](./THV-0035-auth-server-redis-storage.md). |
| Rate limit bypass via identity spoofing | Low | Medium | Rate limiting relies on upstream auth middleware. It is only as strong as the authentication layer. |
| Denial of service via Redis exhaustion | Low | Medium | Keys auto-expire when inactive. Storage is O(1) per counter, bounding memory usage. |
| Unauthenticated DoS | Medium | High | Rate limiting sits after the auth middleware, so unauthenticated requests are rejected before reaching rate limit evaluation. Global limits only count authenticated traffic. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON-RPC Error Response Must Be Specified

[MCP Protocol Expert] "MCP-appropriate error response" is undefined. MCP uses JSON-RPC 2.0 — the RFC must specify:

  1. Error code: Use a code in the -32000 to -32099 range (reserved for implementation-defined server errors). MCP already uses -32002 and -32042. Suggest -32029. Do NOT use -32603 (internal error) — a rate limit is a deliberate policy rejection, not an internal error.

  2. Response structure:

{
  "jsonrpc": "2.0",
  "id": 2,
  "error": {
    "code": -32029,
    "message": "Rate limit exceeded",
    "data": {
      "retryAfterSeconds": 0.6,
      "limitType": "perUser"
    }
  }
}
  1. Retry-After location: Must go in error.data, NOT HTTP headers — stdio and SSE transports have no HTTP layer. The error.data field is the canonical location regardless of transport.

  2. Protocol error vs tool error: This is a protocol-level rejection (the tool was never invoked), so it must be a JSON-RPC error response, NOT a tool result with isError: true. This distinction is critical for MCP clients.

  3. HTTP 429: For Streamable HTTP transport, also return HTTP 429 with Retry-After header as a supplementary transport-level signal (helps proxies/LBs), but the JSON-RPC error is authoritative.

The IETF RFC 6585 reference is only relevant for the HTTP transport layer — consider also referencing JSON-RPC 2.0 spec for error handling.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this defines the client-facing contract. Replaced the vague "MCP-appropriate error response" with a concrete Rejection Response Format section specifying:

  • JSON-RPC error code -32029 (implementation-defined server error range)
  • Retry-After in error.data (transport-agnostic, works across stdio/SSE/Streamable HTTP)
  • Explicit distinction: this is a protocol-level rejection (error), not a tool result with isError: true
  • HTTP 429 with Retry-After header as supplementary signal for Streamable HTTP
  • Example JSON-RPC error response included

Also added JSON-RPC 2.0 spec to references.

### Data Security

No sensitive data is stored in Redis — only token counts and last-refill timestamps. User IDs appear in Redis key names but carry no additional PII.

### Input Validation

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Threat Model Has Gaps

[Security Advisor] Missing threats to add:

  1. TOCTOU race in Lua script (High): If implementation uses separate GET then DECR instead of single EVAL, concurrent requests bypass limits. Must mandate single atomic EVAL.

  2. Distributed clock skew (Medium): Replicas using local time.Now() instead of Redis server time get inconsistent refill rates.

  3. Cross-namespace key collision (Medium): Key schema omits namespace — servers with same name in different namespaces share counters.

  4. Slowloris / connection exhaustion (Medium): Rate limiting counts completed operations. An attacker holding open many slow connections ties up resources without triggering rate limits.

  5. Deliberate Redis DoS to disable rate limiting (Medium): If an attacker can make Redis unavailable, fail-open policy disables all rate limiting. The threat model mentions Redis exhaustion but frames it as storage, not as a strategy to disable rate limiting itself.

  6. User ID aliasing (Medium): If the IDP issues different sub claims for the same human (re-provisioned account, multiple registrations), one person consumes N× the per-user limit.

Also: the "Redis key injection" row should distinguish inherent vs residual risk (post-mitigation likelihood should be Low given the allowlist).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several of these are already addressed by other changes in this push:

  • TOCTOU race: Covered by the atomic Lua script design.
  • Clock skew: Covered by using redis.call('TIME').
  • Cross-namespace collision: Covered by adding namespace to key schema.

For the remaining items (slowloris, Redis DoS, user ID aliasing) — these are outside the scope of rate limiting and addressed by other layers. Added a "Scope and Limitations" note to Security Considerations acknowledging that rate limiting operates at the MCP method level; transport-level attacks, Redis availability, and IDP identity consistency are outside its scope.

Operation names used to construct Redis keys must be validated and sanitized to prevent key injection. Names should be checked against a strict allowlist of characters (alphanumeric, hyphens, underscores, forward slashes for vMCP backend-prefixed names).

### Audit and Logging

Rate-limit rejections are logged with user identity, operation name, and which limit was hit (global, per-user, per-operation). Token counts and refill timestamps are not included in log output.

## Alternatives Considered
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS Contradiction with THV-0035

[Security Advisor] This row says "Require Redis authentication and TLS" but THV-0035 explicitly lists TLS as a non-goal: "TLS/mTLS configuration for Redis connections — future enhancement if needed." THV-0053 resolved this with "service mesh only" for TLS.

Either align with THV-0035's position ("no TLS, rely on service mesh") and state that explicitly, or require TLS — but don't contradict the existing architectural decision.

Also: the RFC should specify that Redis ACLs block MONITOR, SLOWLOG, CLIENT LIST, and DEBUG commands — user IDs appear in key names and could be exposed via these commands. Consider a separate ACL user for rate limiting with ~thv:rl:* key pattern restriction to limit blast radius if credentials are compromised.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on the TLS contradiction. Updated to align with the existing Redis security posture — rate limiting shares the session storage Redis connection and follows the same security approach (authentication, network-level isolation).


### Sliding Window Log

The original proposal used a sliding window log algorithm, which tracks each request timestamp in a sorted set and counts entries within the window. This was replaced with a **token bucket** for the following reasons:

- **Burst handling**: Token bucket naturally allows legitimate bursts after idle periods, while sliding window enforces a strict ceiling regardless of usage pattern.
- **Simpler storage**: Token bucket requires a single Redis hash (two fields) per counter, compared to a sorted set with one entry per request for sliding window — significantly less memory under high throughput.
- **Atomic operations**: The token bucket check-and-decrement is a single Lua script operating on two fields, reducing Redis round-trip complexity.

## References

- [THV-0017: Dynamic Webhook Middleware](./THV-0017-dynamic-webhook-middleware.md) — mentions rate limiting as an external webhook use case
- [THV-0047: Horizontal Scaling for vMCP and Proxy Runner](./THV-0047-vmcp-proxyrunner-horizontal-scaling.md) — relevant to distributed rate limiting concerns
- [THV-0035: Auth Server Redis Storage](./THV-0035-auth-server-redis-storage.md) — existing Redis dependency
- [IETF RFC 6585](https://tools.ietf.org/html/rfc6585) — HTTP 429 Too Many Requests status code

---

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only One Alternative Considered

[Tech Lead] The RFC should evaluate at least one or two more alternatives:

  1. External rate limiting via Envoy/Istio — infrastructure-level, no application code change. Why is application-level preferred?
  2. Webhook-based rate limiting via THV-0017 — the dynamic webhook middleware RFC already mentions rate limiting as an external webhook use case. Why build it into core instead of leveraging the extension mechanism?
  3. Fixed window counters — simpler than token bucket, less precise. Why was this not considered?

Documenting why these were rejected strengthens the case for the chosen approach.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Added three more alternatives: Envoy/Istio (can't do MCP-aware enforcement), webhook-based via THV-0017 (adds latency on hot path), and fixed window counters (boundary burst problem).

## RFC Lifecycle

<!-- This section is maintained by RFC reviewers -->

### Review History

| Date | Reviewer | Decision | Notes |
|------|----------|----------|-------|
| 2026-03-18 | @jerm-dro | Draft | Initial submission |

### Implementation Tracking
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing RFC Sections

[Tech Lead] Compared to peer RFCs (THV-0035, THV-0017, THV-0053) and the RFC template, these mandatory or expected sections are absent:

  • Detailed Design: Go type definitions (CRD types with kubebuilder tags), middleware interface, Lua script, sequence diagrams
  • Testing Strategy: Unit (miniredis), integration (testcontainers), E2E (Chainsaw), performance benchmarks
  • Migration / Rollback Plan: What happens during upgrade? How to disable if problems arise?
  • Performance Impact: Latency budget for the Lua script on the hot path. Target? Fallback if Redis latency exceeds threshold?
  • Compatibility: Backward (existing CRDs without rateLimiting), forward (extensibility for future algorithms)
  • Implementation Plan: Phased task breakdown — suggested 5-6 PRs within the 400-line/10-file limit:
    1. Core rate limiter package + Lua script (~300 lines)
    2. Rate limit middleware + MCP parser integration (~250 lines)
    3. CRD type definitions + validation (~350 lines)
    4. Operator controller wiring (~200 lines)
    5. Integration/E2E tests (~300 lines)
  • Open Questions: Several remain unresolved (see other comments)
  • Config hot-reload semantics: When rateLimiting is updated on the CRD, do pods restart or hot-reload? What happens to existing bucket state in Redis if limits change?
  • Documentation plan: User docs, configuration examples, operational runbook

Recommend using THV-0035 and THV-0017 as reference points for the expected level of detail.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. This is a smaller-scope RFC — rateLimiting is a new optional field with no migration concerns, and we have prior art for Redis testing patterns. The additional sections (testing strategy, Go types, sequence diagrams, benchmarks, hot-reload semantics, PR breakdown) are implementation details that will be addressed in the PRs.


| Repository | PR | Status |
|------------|-----|--------|
| toolhive | TBD | Not started |