diff --git a/docs/features/rate-limiter.mdx b/docs/features/rate-limiter.mdx index 9dc76760..202a7eec 100644 --- a/docs/features/rate-limiter.mdx +++ b/docs/features/rate-limiter.mdx @@ -1,62 +1,213 @@ --- title: "Rate Limiter" -description: "Token bucket rate limiting for LLM API calls" +sidebarTitle: "Rate Limiter" +description: "Cap API request rate and token usage across agents and threads" +icon: "gauge-high" --- -## Overview +Rate Limiter caps how fast your agents call the LLM, so you stay inside provider rate limits and protect your budget — safely, even when many agents share the same limiter. -Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately. +```mermaid +graph LR + subgraph "Shared Rate Limiter" + A1[🤖 Agent 1] --> L{⚖️ RateLimiter
60 rpm} + A2[🤖 Agent 2] --> L + A3[🤖 Agent 3] --> L + L -->|Allow| API[☁️ LLM API] + L -->|Wait| Queue[⏳ Queued] + Queue --> L + end + + classDef agent fill:#8B0000,stroke:#7C90A0,color:#fff + classDef limiter fill:#F59E0B,stroke:#7C90A0,color:#fff + classDef api fill:#10B981,stroke:#7C90A0,color:#fff + classDef queue fill:#6366F1,stroke:#7C90A0,color:#fff + + class A1,A2,A3 agent + class L limiter + class API api + class Queue queue +``` + +The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately. ## Quick Start - - + + ```python from praisonaiagents import Agent from praisonaiagents.config.feature_configs import ExecutionConfig + +agent = Agent( + name="Researcher", + instructions="You research topics on the web.", + execution=ExecutionConfig(max_rpm=60) +) + +agent.start("Summarise the latest Mars rover news") +``` + + + +```python +from praisonaiagents import Agent, PraisonAIAgents +from praisonaiagents.config.feature_configs import ExecutionConfig from praisonaiagents.llm import RateLimiter -limiter = RateLimiter(requests_per_minute=60, burst=5) +shared = RateLimiter(requests_per_minute=60, burst=5) -agent = Agent( - name="Bot", - instructions="Helper", - execution=ExecutionConfig(rate_limiter=limiter) +researcher = Agent( + name="Researcher", + instructions="Research topics", + execution=ExecutionConfig(rate_limiter=shared) +) +writer = Agent( + name="Writer", + instructions="Write articles", + execution=ExecutionConfig(rate_limiter=shared) ) -response = agent.chat("Hello") +team = PraisonAIAgents(agents=[researcher, writer]) +team.start() ``` - - + + +The same `RateLimiter` instance can be shared across any number of agents and threads — the combined throughput stays inside the configured budget. + + + + ```python from praisonaiagents import Agent from praisonaiagents.config.feature_configs import ExecutionConfig +from praisonaiagents.llm import RateLimiter + +limiter = RateLimiter( + requests_per_minute=60, + tokens_per_minute=90_000, + burst=5, +) agent = Agent( - name="Bot", - instructions="Helper", - execution=ExecutionConfig(max_rpm=60) + name="Analyst", + instructions="Analyse long documents", + execution=ExecutionConfig(rate_limiter=limiter) ) +``` + + + +--- + +## How It Works + +```mermaid +sequenceDiagram + participant Agent1 + participant Agent2 + participant Limiter as RateLimiter + participant LLM + + Agent1->>Limiter: acquire() + Limiter->>Limiter: lock → refill → -1 token + Limiter-->>Agent1: ok + Agent1->>LLM: request + + Agent2->>Limiter: acquire() + Limiter->>Limiter: lock (waits for Agent1) + Limiter->>Limiter: refill → -1 token + Limiter-->>Agent2: ok + Agent2->>LLM: request +``` + +| Step | What happens | +|------|--------------| +| Refill | Tokens regenerate based on elapsed time and `requests_per_minute` / `tokens_per_minute`. | +| Acquire | A thread reserves a token; under contention, only one thread mutates state at a time. | +| Wait | If no tokens are available, the caller sleeps (sync) or awaits (async) until the next refill. | +| Release | No explicit release — tokens refill automatically on a rolling window. | + +--- + +## Choose Your Mode + +```mermaid +graph TB + Start[Need rate limiting?] --> Q1{Single agent,
simple RPM?} + Q1 -->|Yes| A[Use max_rpm=N
in ExecutionConfig] + Q1 -->|No| Q2{Multiple agents
sharing budget?} + Q2 -->|Yes| B[Create one RateLimiter
and pass to each agent] + Q2 -->|No| Q3{Provider quotes
TPM not just RPM?} + Q3 -->|Yes| C[Set tokens_per_minute
on RateLimiter] + Q3 -->|No| A + + classDef question fill:#6366F1,stroke:#7C90A0,color:#fff + classDef answer fill:#10B981,stroke:#7C90A0,color:#fff -response = agent.chat("Hello") + class Start,Q1,Q2,Q3 question + class A,B,C answer ``` -
-
+ +--- + +## Configuration Options + +| Option | Type | Default | Description | +|--------|------|---------|-------------| +| `requests_per_minute` | `int` | Required | Max LLM requests per rolling 60-second window. | +| `tokens_per_minute` | `int` | `None` | Optional token-budget limit (for TPM-quoted providers). | +| `burst` | `int` | `1` | Max burst size — requests allowed back-to-back before the rate kicks in. | + +--- + +## Thread Safety & Multi-Agent Use -The standalone `rate_limiter=` parameter is deprecated. Use `execution=ExecutionConfig(rate_limiter=obj)` instead. +Every method on `RateLimiter` — both sync (`acquire`, `acquire_tokens`, `try_acquire`, `reset`) and async (`acquire_async`, `acquire_tokens_async`) — is safe to call concurrently. You can share a single `RateLimiter` across threads, `AgentTeam` members, `PraisonAIAgents`, and `ParallelToolCallExecutor` workers without exceeding the configured budget. -## Parameters +### Thread pool with shared limiter + +```python +from concurrent.futures import ThreadPoolExecutor +from praisonaiagents import Agent +from praisonaiagents.config.feature_configs import ExecutionConfig +from praisonaiagents.llm import RateLimiter + +limiter = RateLimiter(requests_per_minute=60, burst=5) + +def run_agent(question: str) -> str: + agent = Agent( + name="Worker", + instructions="Answer concisely", + execution=ExecutionConfig(rate_limiter=limiter), + ) + return agent.start(question) + +with ThreadPoolExecutor(max_workers=10) as pool: + answers = list(pool.map(run_agent, [f"Q{i}" for i in range(50)])) +``` + +### Monitoring available budget + +```python +limiter = RateLimiter(requests_per_minute=60, tokens_per_minute=90_000) + +print(f"Requests left: {limiter.available_tokens:.1f}") +print(f"API tokens left: {limiter.available_api_tokens:.1f}") +``` + + +`available_tokens` and `available_api_tokens` are safe to read from any thread — they acquire the same locks as `acquire()` internally. + -| Parameter | Description | Default | -|-----------|-------------|---------| -| `requests_per_minute` | Max requests per minute | Required | -| `tokens_per_minute` | Token-based limiting (optional) | None | -| `burst` | Max burst size | 1 | +--- ## Manual Usage +When not using `ExecutionConfig`, you can acquire tokens directly: + ```python limiter = RateLimiter(requests_per_minute=60) @@ -72,8 +223,45 @@ if limiter.try_acquire(): pass ``` +--- + +## Best Practices + + + +If three agents hit the same provider key, give them the same `RateLimiter` so the combined throughput stays inside quota. + + + +A low burst (1–5) smooths traffic; a high burst tolerates spiky demand. + + + +OpenAI / Anthropic quote both RPM and TPM — limiting only on RPM can still trip 429s. + + + +`agent.achat(...)` automatically calls `acquire_async()`; avoid mixing sync and async limiters in one workflow. + + + +--- + ## CLI ```bash praisonai "task" --rpm 60 ``` + +--- + +## Related + + + + Thread-safe chat history and caches + + + Limit parallel agent runs + + \ No newline at end of file diff --git a/docs/features/thread-safety.mdx b/docs/features/thread-safety.mdx index e278da61..dccd8d66 100644 --- a/docs/features/thread-safety.mdx +++ b/docs/features/thread-safety.mdx @@ -84,6 +84,10 @@ Internal caches use `threading.RLock` for reentrant locking: - `_system_prompt_cache` - Cached system prompts - `_formatted_tools_cache` - Cached tool definitions +### Rate Limiter + +`RateLimiter` can be shared across threads and agents. Both the sync and async method families are fully locked — see [Rate Limiter → Thread Safety & Multi-Agent Use](/docs/features/rate-limiter#thread-safety--multi-agent-use) for patterns. + ## LiteAgent Thread Safety The lite package also provides thread-safe operations: @@ -112,6 +116,8 @@ with threading.ThreadPoolExecutor(max_workers=5) as executor: |-----------|-----------|--------| | chat_history | `Lock` | Simple mutual exclusion | | caches | `RLock` | Allows reentrant access | +| `RateLimiter` (sync) | `threading.Lock` | Protects `_tokens`, `_api_tokens`, and refill state from races in multi-threaded acquire calls | +| `RateLimiter` (async) | `asyncio.Lock` | Same protection for coroutine contexts | ### Lock Usage Pattern