Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/toolhive/guides-k8s/auth-k8s.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -901,6 +901,8 @@ kubectl logs -n toolhive-system -l app.kubernetes.io/name=weather-server-k8s

## Next steps

- [Configure rate limiting](./rate-limiting.mdx) to set per-user and shared
request limits on MCP servers
- [Configure token exchange](./token-exchange-k8s.mdx) to let MCP servers
authenticate to backend services
- [Set up audit logging](./logging.mdx) to track authentication decisions and
Expand Down
202 changes: 202 additions & 0 deletions docs/toolhive/guides-k8s/rate-limiting.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
---
title: Rate limiting
description:
Configure per-user and shared rate limits on MCPServer resources to prevent
noisy neighbors and protect downstream services.
---

Configure token bucket rate limits on MCPServer resources to control how many
tool invocations users can make. Rate limiting prevents individual users from
monopolizing shared servers and protects downstream services from traffic
spikes.

ToolHive supports two scopes of rate limiting:

- **Shared** limits cap total requests across all users.
- **Per-user** limits cap requests independently for each authenticated user.

Both scopes can be applied at the server level and overridden per tool. A
request must pass all applicable limits to proceed.

:::info[Prerequisites]

Before you begin, ensure you have:

- A Kubernetes cluster with the ToolHive Operator installed
- Redis deployed in your cluster — rate limiting stores token bucket counters in
Redis (see [Redis Sentinel session storage](./redis-session-storage.mdx) for
deployment instructions)
- For per-user limits: authentication enabled on the MCPServer (`oidcConfig`,
`oidcConfigRef`, or `externalAuthConfigRef`)

If you need help with these prerequisites, see:

- [Kubernetes quickstart](./quickstart.mdx)
- [Authentication and authorization](./auth-k8s.mdx)

:::

## How rate limiting works

Rate limits use a **token bucket** algorithm. Each bucket has a capacity
(`maxTokens`) and a refill period (`refillPeriod`). The bucket starts full and
each `tools/call` request consumes one token. When the bucket is empty, requests
are rejected until tokens refill. The refill rate is `maxTokens / refillPeriod`
tokens per second.

Only `tools/call` requests are rate-limited. Lifecycle methods (`initialize`,
`ping`) and discovery methods (`tools/list`, `prompts/list`) pass through
unconditionally.

When a request is rejected, the proxy returns:

- **HTTP 429** with a `Retry-After` header (seconds until a token is available)
- A **JSON-RPC error** with code `-32029` and `retryAfterSeconds` in the error
data

If Redis is unreachable, rate limiting **fails open** and all requests are
allowed through.

## Configure shared rate limits

Shared limits apply a single token bucket across all users. Use them to cap
total throughput to protect downstream services.

```yaml title="mcpserver-shared-ratelimit.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: weather-server
spec:
image: ghcr.io/stackloklabs/weather-mcp/server
transport: streamable-http
sessionStorage:
provider: redis
address: <YOUR_REDIS_ADDRESS>
# highlight-start
rateLimiting:
shared:
maxTokens: 1000
refillPeriod: 1m0s
# highlight-end
```

This allows 1,000 total `tools/call` requests per minute across all users.

## Configure per-user rate limits

Per-user limits give each authenticated user their own independent token bucket.
This prevents a single user from consuming the entire server capacity.

Per-user limits **require authentication** to be enabled. The proxy identifies
users by the `sub` claim from their JWT token.

```yaml title="mcpserver-peruser-ratelimit.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: weather-server
spec:
image: ghcr.io/stackloklabs/weather-mcp/server
transport: streamable-http
oidcConfig:
type: inline
inline:
issuer: https://my-idp.example.com
audience: my-audience
sessionStorage:
provider: redis
address: <YOUR_REDIS_ADDRESS>
# highlight-start
rateLimiting:
perUser:
maxTokens: 100
refillPeriod: 1m0s
# highlight-end
```

This allows each user 100 `tools/call` requests per minute independently.

## Combine shared and per-user limits

You can configure both scopes together. A request must pass **all** applicable
limits. This lets you set a per-user ceiling while also capping total server
throughput.

```yaml title="mcpserver-combined-ratelimit.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: weather-server
spec:
image: ghcr.io/stackloklabs/weather-mcp/server
transport: streamable-http
oidcConfig:
type: inline
inline:
issuer: https://my-idp.example.com
audience: my-audience
sessionStorage:
provider: redis
address: <YOUR_REDIS_ADDRESS>
rateLimiting:
# highlight-start
shared:
maxTokens: 1000
refillPeriod: 1m0s
perUser:
maxTokens: 100
refillPeriod: 1m0s
# highlight-end
```

## Add per-tool overrides

Individual tools can have tighter limits than the server default. Per-tool
limits are enforced **in addition to** server-level limits.

```yaml title="mcpserver-pertool-ratelimit.yaml"
apiVersion: toolhive.stacklok.dev/v1alpha1
kind: MCPServer
metadata:
name: weather-server
spec:
image: ghcr.io/stackloklabs/weather-mcp/server
transport: streamable-http
oidcConfig:
type: inline
inline:
issuer: https://my-idp.example.com
audience: my-audience
sessionStorage:
provider: redis
address: <YOUR_REDIS_ADDRESS>
rateLimiting:
perUser:
maxTokens: 100
refillPeriod: 1m0s
# highlight-start
tools:
- name: expensive_search
perUser:
maxTokens: 10
refillPeriod: 1m0s
- name: shared_resource
shared:
maxTokens: 50
refillPeriod: 1m0s
# highlight-end
```

In this example:

- Each user can make 100 total tool calls per minute.
- Each user can make at most 10 `expensive_search` calls per minute (and those
also count toward the 100 server-level limit).
- All users combined can make 50 `shared_resource` calls per minute.

## Next steps

- [Token exchange](./token-exchange-k8s.mdx) to configure token exchange for
upstream service authentication
- [CRD reference](../reference/crd-spec.md) for complete field definitions
6 changes: 4 additions & 2 deletions docs/toolhive/guides-k8s/redis-session-storage.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,10 @@ when pods restart and users must re-authenticate. Redis Sentinel provides
persistent storage with automatic master discovery, ACL-based access control,
and optional failover when replicas are configured.

Redis session storage is also required for horizontal scaling when running
multiple [MCPServer](./run-mcp-k8s.mdx#horizontal-scaling) or
Redis is also required as the backend for [rate limiting](./rate-limiting.mdx),
which stores token bucket counters in Redis independently of session data. It is
also required for horizontal scaling when running multiple
[MCPServer](./run-mcp-k8s.mdx#horizontal-scaling) or
[VirtualMCPServer](../guides-vmcp/scaling-and-performance.mdx#session-storage-for-multi-replica-deployments)
replicas, so that sessions are shared across pods.

Expand Down
1 change: 1 addition & 0 deletions sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -152,6 +152,7 @@ const sidebars: SidebarsConfig = {
'toolhive/guides-k8s/customize-tools',
'toolhive/guides-k8s/auth-k8s',
'toolhive/guides-k8s/redis-session-storage',
'toolhive/guides-k8s/rate-limiting',
'toolhive/guides-k8s/token-exchange-k8s',
'toolhive/guides-k8s/telemetry-and-metrics',
'toolhive/guides-k8s/logging',
Expand Down