diff --git a/docs/features/rate-limiter.mdx b/docs/features/rate-limiter.mdx
index 9dc76760..202a7eec 100644
--- a/docs/features/rate-limiter.mdx
+++ b/docs/features/rate-limiter.mdx
@@ -1,62 +1,213 @@
---
title: "Rate Limiter"
-description: "Token bucket rate limiting for LLM API calls"
+sidebarTitle: "Rate Limiter"
+description: "Cap API request rate and token usage across agents and threads"
+icon: "gauge-high"
---
-## Overview
+Rate Limiter caps how fast your agents call the LLM, so you stay inside provider rate limits and protect your budget — safely, even when many agents share the same limiter.
-Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.
+```mermaid
+graph LR
+ subgraph "Shared Rate Limiter"
+ A1[🤖 Agent 1] --> L{⚖️ RateLimiter
60 rpm}
+ A2[🤖 Agent 2] --> L
+ A3[🤖 Agent 3] --> L
+ L -->|Allow| API[☁️ LLM API]
+ L -->|Wait| Queue[⏳ Queued]
+ Queue --> L
+ end
+
+ classDef agent fill:#8B0000,stroke:#7C90A0,color:#fff
+ classDef limiter fill:#F59E0B,stroke:#7C90A0,color:#fff
+ classDef api fill:#10B981,stroke:#7C90A0,color:#fff
+ classDef queue fill:#6366F1,stroke:#7C90A0,color:#fff
+
+ class A1,A2,A3 agent
+ class L limiter
+ class API api
+ class Queue queue
+```
+
+The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.
## Quick Start
-
-
+
+
```python
from praisonaiagents import Agent
from praisonaiagents.config.feature_configs import ExecutionConfig
+
+agent = Agent(
+ name="Researcher",
+ instructions="You research topics on the web.",
+ execution=ExecutionConfig(max_rpm=60)
+)
+
+agent.start("Summarise the latest Mars rover news")
+```
+
+
+
+```python
+from praisonaiagents import Agent, PraisonAIAgents
+from praisonaiagents.config.feature_configs import ExecutionConfig
from praisonaiagents.llm import RateLimiter
-limiter = RateLimiter(requests_per_minute=60, burst=5)
+shared = RateLimiter(requests_per_minute=60, burst=5)
-agent = Agent(
- name="Bot",
- instructions="Helper",
- execution=ExecutionConfig(rate_limiter=limiter)
+researcher = Agent(
+ name="Researcher",
+ instructions="Research topics",
+ execution=ExecutionConfig(rate_limiter=shared)
+)
+writer = Agent(
+ name="Writer",
+ instructions="Write articles",
+ execution=ExecutionConfig(rate_limiter=shared)
)
-response = agent.chat("Hello")
+team = PraisonAIAgents(agents=[researcher, writer])
+team.start()
```
-
-
+
+
+The same `RateLimiter` instance can be shared across any number of agents and threads — the combined throughput stays inside the configured budget.
+
+
+
+
```python
from praisonaiagents import Agent
from praisonaiagents.config.feature_configs import ExecutionConfig
+from praisonaiagents.llm import RateLimiter
+
+limiter = RateLimiter(
+ requests_per_minute=60,
+ tokens_per_minute=90_000,
+ burst=5,
+)
agent = Agent(
- name="Bot",
- instructions="Helper",
- execution=ExecutionConfig(max_rpm=60)
+ name="Analyst",
+ instructions="Analyse long documents",
+ execution=ExecutionConfig(rate_limiter=limiter)
)
+```
+
+
+
+---
+
+## How It Works
+
+```mermaid
+sequenceDiagram
+ participant Agent1
+ participant Agent2
+ participant Limiter as RateLimiter
+ participant LLM
+
+ Agent1->>Limiter: acquire()
+ Limiter->>Limiter: lock → refill → -1 token
+ Limiter-->>Agent1: ok
+ Agent1->>LLM: request
+
+ Agent2->>Limiter: acquire()
+ Limiter->>Limiter: lock (waits for Agent1)
+ Limiter->>Limiter: refill → -1 token
+ Limiter-->>Agent2: ok
+ Agent2->>LLM: request
+```
+
+| Step | What happens |
+|------|--------------|
+| Refill | Tokens regenerate based on elapsed time and `requests_per_minute` / `tokens_per_minute`. |
+| Acquire | A thread reserves a token; under contention, only one thread mutates state at a time. |
+| Wait | If no tokens are available, the caller sleeps (sync) or awaits (async) until the next refill. |
+| Release | No explicit release — tokens refill automatically on a rolling window. |
+
+---
+
+## Choose Your Mode
+
+```mermaid
+graph TB
+ Start[Need rate limiting?] --> Q1{Single agent,
simple RPM?}
+ Q1 -->|Yes| A[Use max_rpm=N
in ExecutionConfig]
+ Q1 -->|No| Q2{Multiple agents
sharing budget?}
+ Q2 -->|Yes| B[Create one RateLimiter
and pass to each agent]
+ Q2 -->|No| Q3{Provider quotes
TPM not just RPM?}
+ Q3 -->|Yes| C[Set tokens_per_minute
on RateLimiter]
+ Q3 -->|No| A
+
+ classDef question fill:#6366F1,stroke:#7C90A0,color:#fff
+ classDef answer fill:#10B981,stroke:#7C90A0,color:#fff
-response = agent.chat("Hello")
+ class Start,Q1,Q2,Q3 question
+ class A,B,C answer
```
-
-
+
+---
+
+## Configuration Options
+
+| Option | Type | Default | Description |
+|--------|------|---------|-------------|
+| `requests_per_minute` | `int` | Required | Max LLM requests per rolling 60-second window. |
+| `tokens_per_minute` | `int` | `None` | Optional token-budget limit (for TPM-quoted providers). |
+| `burst` | `int` | `1` | Max burst size — requests allowed back-to-back before the rate kicks in. |
+
+---
+
+## Thread Safety & Multi-Agent Use
-The standalone `rate_limiter=` parameter is deprecated. Use `execution=ExecutionConfig(rate_limiter=obj)` instead.
+Every method on `RateLimiter` — both sync (`acquire`, `acquire_tokens`, `try_acquire`, `reset`) and async (`acquire_async`, `acquire_tokens_async`) — is safe to call concurrently. You can share a single `RateLimiter` across threads, `AgentTeam` members, `PraisonAIAgents`, and `ParallelToolCallExecutor` workers without exceeding the configured budget.
-## Parameters
+### Thread pool with shared limiter
+
+```python
+from concurrent.futures import ThreadPoolExecutor
+from praisonaiagents import Agent
+from praisonaiagents.config.feature_configs import ExecutionConfig
+from praisonaiagents.llm import RateLimiter
+
+limiter = RateLimiter(requests_per_minute=60, burst=5)
+
+def run_agent(question: str) -> str:
+ agent = Agent(
+ name="Worker",
+ instructions="Answer concisely",
+ execution=ExecutionConfig(rate_limiter=limiter),
+ )
+ return agent.start(question)
+
+with ThreadPoolExecutor(max_workers=10) as pool:
+ answers = list(pool.map(run_agent, [f"Q{i}" for i in range(50)]))
+```
+
+### Monitoring available budget
+
+```python
+limiter = RateLimiter(requests_per_minute=60, tokens_per_minute=90_000)
+
+print(f"Requests left: {limiter.available_tokens:.1f}")
+print(f"API tokens left: {limiter.available_api_tokens:.1f}")
+```
+
+
+`available_tokens` and `available_api_tokens` are safe to read from any thread — they acquire the same locks as `acquire()` internally.
+
-| Parameter | Description | Default |
-|-----------|-------------|---------|
-| `requests_per_minute` | Max requests per minute | Required |
-| `tokens_per_minute` | Token-based limiting (optional) | None |
-| `burst` | Max burst size | 1 |
+---
## Manual Usage
+When not using `ExecutionConfig`, you can acquire tokens directly:
+
```python
limiter = RateLimiter(requests_per_minute=60)
@@ -72,8 +223,45 @@ if limiter.try_acquire():
pass
```
+---
+
+## Best Practices
+
+
+
+If three agents hit the same provider key, give them the same `RateLimiter` so the combined throughput stays inside quota.
+
+
+
+A low burst (1–5) smooths traffic; a high burst tolerates spiky demand.
+
+
+
+OpenAI / Anthropic quote both RPM and TPM — limiting only on RPM can still trip 429s.
+
+
+
+`agent.achat(...)` automatically calls `acquire_async()`; avoid mixing sync and async limiters in one workflow.
+
+
+
+---
+
## CLI
```bash
praisonai "task" --rpm 60
```
+
+---
+
+## Related
+
+
+
+ Thread-safe chat history and caches
+
+
+ Limit parallel agent runs
+
+
\ No newline at end of file
diff --git a/docs/features/thread-safety.mdx b/docs/features/thread-safety.mdx
index e278da61..dccd8d66 100644
--- a/docs/features/thread-safety.mdx
+++ b/docs/features/thread-safety.mdx
@@ -84,6 +84,10 @@ Internal caches use `threading.RLock` for reentrant locking:
- `_system_prompt_cache` - Cached system prompts
- `_formatted_tools_cache` - Cached tool definitions
+### Rate Limiter
+
+`RateLimiter` can be shared across threads and agents. Both the sync and async method families are fully locked — see [Rate Limiter → Thread Safety & Multi-Agent Use](/docs/features/rate-limiter#thread-safety--multi-agent-use) for patterns.
+
## LiteAgent Thread Safety
The lite package also provides thread-safe operations:
@@ -112,6 +116,8 @@ with threading.ThreadPoolExecutor(max_workers=5) as executor:
|-----------|-----------|--------|
| chat_history | `Lock` | Simple mutual exclusion |
| caches | `RLock` | Allows reentrant access |
+| `RateLimiter` (sync) | `threading.Lock` | Protects `_tokens`, `_api_tokens`, and refill state from races in multi-threaded acquire calls |
+| `RateLimiter` (async) | `asyncio.Lock` | Same protection for coroutine contexts |
### Lock Usage Pattern