Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
240 changes: 214 additions & 26 deletions docs/features/rate-limiter.mdx
Original file line number Diff line number Diff line change
@@ -1,62 +1,213 @@
---
title: "Rate Limiter"
description: "Token bucket rate limiting for LLM API calls"
sidebarTitle: "Rate Limiter"
description: "Cap API request rate and token usage across agents and threads"
icon: "gauge-high"
---

## Overview
Rate Limiter caps how fast your agents call the LLM, so you stay inside provider rate limits and protect your budget — safely, even when many agents share the same limiter.

Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.
```mermaid
graph LR
subgraph "Shared Rate Limiter"
A1[🤖 Agent 1] --> L{⚖️ RateLimiter<br/>60 rpm}
A2[🤖 Agent 2] --> L
A3[🤖 Agent 3] --> L
L -->|Allow| API[☁️ LLM API]
L -->|Wait| Queue[⏳ Queued]
Queue --> L
end

classDef agent fill:#8B0000,stroke:#7C90A0,color:#fff
classDef limiter fill:#F59E0B,stroke:#7C90A0,color:#fff
classDef api fill:#10B981,stroke:#7C90A0,color:#fff
classDef queue fill:#6366F1,stroke:#7C90A0,color:#fff

class A1,A2,A3 agent
class L limiter
class API api
class Queue queue
```

The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.

## Quick Start

<Tabs>
<Tab title="ExecutionConfig (Recommended)">
<Steps>
<Step title="Simple RPM limit on one agent">
```python
from praisonaiagents import Agent
from praisonaiagents.config.feature_configs import ExecutionConfig

agent = Agent(
name="Researcher",
instructions="You research topics on the web.",
execution=ExecutionConfig(max_rpm=60)
)

agent.start("Summarise the latest Mars rover news")
```
</Step>

<Step title="Share one limiter across multiple agents">
```python
from praisonaiagents import Agent, PraisonAIAgents
from praisonaiagents.config.feature_configs import ExecutionConfig
from praisonaiagents.llm import RateLimiter

limiter = RateLimiter(requests_per_minute=60, burst=5)
shared = RateLimiter(requests_per_minute=60, burst=5)

agent = Agent(
name="Bot",
instructions="Helper",
execution=ExecutionConfig(rate_limiter=limiter)
researcher = Agent(
name="Researcher",
instructions="Research topics",
execution=ExecutionConfig(rate_limiter=shared)
)
writer = Agent(
name="Writer",
instructions="Write articles",
execution=ExecutionConfig(rate_limiter=shared)
)

response = agent.chat("Hello")
team = PraisonAIAgents(agents=[researcher, writer])
team.start()
```
</Tab>
<Tab title="Simple RPM Limit">

<Note>
The same `RateLimiter` instance can be shared across any number of agents and threads — the combined throughput stays inside the configured budget.
</Note>
</Step>

<Step title="Token-based limiting (for TPM-quoted providers)">
```python
from praisonaiagents import Agent
from praisonaiagents.config.feature_configs import ExecutionConfig
from praisonaiagents.llm import RateLimiter

limiter = RateLimiter(
requests_per_minute=60,
tokens_per_minute=90_000,
burst=5,
)

agent = Agent(
name="Bot",
instructions="Helper",
execution=ExecutionConfig(max_rpm=60)
name="Analyst",
instructions="Analyse long documents",
execution=ExecutionConfig(rate_limiter=limiter)
)
```
</Step>
</Steps>

---

## How It Works

```mermaid
sequenceDiagram
participant Agent1
participant Agent2
participant Limiter as RateLimiter
participant LLM

Agent1->>Limiter: acquire()
Limiter->>Limiter: lock → refill → -1 token
Limiter-->>Agent1: ok
Agent1->>LLM: request

Agent2->>Limiter: acquire()
Limiter->>Limiter: lock (waits for Agent1)
Limiter->>Limiter: refill → -1 token
Limiter-->>Agent2: ok
Agent2->>LLM: request
```

| Step | What happens |
|------|--------------|
| Refill | Tokens regenerate based on elapsed time and `requests_per_minute` / `tokens_per_minute`. |
| Acquire | A thread reserves a token; under contention, only one thread mutates state at a time. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The description for the 'Acquire' step states that 'under contention, only one thread mutates state at a time'. This is currently inaccurate for synchronous calls. The implementation of RateLimiter.acquire() in praisonaiagents/llm/rate_limiter.py does not use any threading.Lock, meaning concurrent calls from multiple threads will result in race conditions when updating internal state like _tokens and _last_update.

| Wait | If no tokens are available, the caller sleeps (sync) or awaits (async) until the next refill. |
| Release | No explicit release — tokens refill automatically on a rolling window. |

---

## Choose Your Mode

```mermaid
graph TB
Start[Need rate limiting?] --> Q1{Single agent,<br/>simple RPM?}
Q1 -->|Yes| A[Use max_rpm=N<br/>in ExecutionConfig]
Q1 -->|No| Q2{Multiple agents<br/>sharing budget?}
Q2 -->|Yes| B[Create one RateLimiter<br/>and pass to each agent]
Q2 -->|No| Q3{Provider quotes<br/>TPM not just RPM?}
Q3 -->|Yes| C[Set tokens_per_minute<br/>on RateLimiter]
Q3 -->|No| A

classDef question fill:#6366F1,stroke:#7C90A0,color:#fff
classDef answer fill:#10B981,stroke:#7C90A0,color:#fff

response = agent.chat("Hello")
class Start,Q1,Q2,Q3 question
class A,B,C answer
```
</Tab>
</Tabs>

---

## Configuration Options

| Option | Type | Default | Description |
|--------|------|---------|-------------|
| `requests_per_minute` | `int` | Required | Max LLM requests per rolling 60-second window. |
| `tokens_per_minute` | `int` | `None` | Optional token-budget limit (for TPM-quoted providers). |
| `burst` | `int` | `1` | Max burst size — requests allowed back-to-back before the rate kicks in. |

---

## Thread Safety & Multi-Agent Use

<Note>
The standalone `rate_limiter=` parameter is deprecated. Use `execution=ExecutionConfig(rate_limiter=obj)` instead.
Every method on `RateLimiter` — both sync (`acquire`, `acquire_tokens`, `try_acquire`, `reset`) and async (`acquire_async`, `acquire_tokens_async`) — is safe to call concurrently. You can share a single `RateLimiter` across threads, `AgentTeam` members, `PraisonAIAgents`, and `ParallelToolCallExecutor` workers without exceeding the configured budget.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This section claims that synchronous methods such as acquire and reset are safe to call concurrently. However, the source code for RateLimiter lacks threading.Lock protection for these methods. Sharing a single instance across threads (e.g., using ThreadPoolExecutor as shown in the example below) is not thread-safe for synchronous operations in the current implementation.

</Note>

## Parameters
### Thread pool with shared limiter

```python
from concurrent.futures import ThreadPoolExecutor
from praisonaiagents import Agent
from praisonaiagents.config.feature_configs import ExecutionConfig
from praisonaiagents.llm import RateLimiter

limiter = RateLimiter(requests_per_minute=60, burst=5)

def run_agent(question: str) -> str:
agent = Agent(
name="Worker",
instructions="Answer concisely",
execution=ExecutionConfig(rate_limiter=limiter),
)
return agent.start(question)

with ThreadPoolExecutor(max_workers=10) as pool:
answers = list(pool.map(run_agent, [f"Q{i}" for i in range(50)]))
```

### Monitoring available budget

```python
limiter = RateLimiter(requests_per_minute=60, tokens_per_minute=90_000)

print(f"Requests left: {limiter.available_tokens:.1f}")
print(f"API tokens left: {limiter.available_api_tokens:.1f}")
```

<Note>
`available_tokens` and `available_api_tokens` are safe to read from any thread — they acquire the same locks as `acquire()` internally.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion that available_tokens and available_api_tokens acquire locks internally is incorrect. Neither these properties nor the synchronous acquire() method implement locking in the provided source code. This makes monitoring the budget from multiple threads unsafe as it triggers state mutations via _refill() without synchronization.

</Note>

| Parameter | Description | Default |
|-----------|-------------|---------|
| `requests_per_minute` | Max requests per minute | Required |
| `tokens_per_minute` | Token-based limiting (optional) | None |
| `burst` | Max burst size | 1 |
---

## Manual Usage

When not using `ExecutionConfig`, you can acquire tokens directly:

```python
limiter = RateLimiter(requests_per_minute=60)

Expand All @@ -72,8 +223,45 @@ if limiter.try_acquire():
pass
```

---

## Best Practices

<AccordionGroup>
<Accordion title="Share one limiter across related agents">
If three agents hit the same provider key, give them the same `RateLimiter` so the combined throughput stays inside quota.
</Accordion>

<Accordion title="Match burst to your workload">
A low burst (1–5) smooths traffic; a high burst tolerates spiky demand.
</Accordion>

<Accordion title="Use tokens_per_minute when the provider charges by tokens">
OpenAI / Anthropic quote both RPM and TPM — limiting only on RPM can still trip 429s.
</Accordion>

<Accordion title="Prefer async paths in async flows">
`agent.achat(...)` automatically calls `acquire_async()`; avoid mixing sync and async limiters in one workflow.
</Accordion>
</AccordionGroup>

---

## CLI

```bash
praisonai "task" --rpm 60
```

---

## Related

<CardGroup cols={2}>
<Card icon="lock" href="/docs/features/thread-safety">
Thread-safe chat history and caches
</Card>
<Card icon="gauge" href="/docs/features/concurrency">
Limit parallel agent runs
</Card>
</CardGroup>
6 changes: 6 additions & 0 deletions docs/features/thread-safety.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,10 @@ Internal caches use `threading.RLock` for reentrant locking:
- `_system_prompt_cache` - Cached system prompts
- `_formatted_tools_cache` - Cached tool definitions

### Rate Limiter

`RateLimiter` can be shared across threads and agents. Both the sync and async method families are fully locked — see [Rate Limiter → Thread Safety & Multi-Agent Use](/docs/features/rate-limiter#thread-safety--multi-agent-use) for patterns.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation states that synchronous method families of RateLimiter are 'fully locked'. However, the implementation in praisonaiagents/llm/rate_limiter.py does not use locks for synchronous methods, only for the lazy initialization of async locks. This statement is misleading for users expecting thread safety in multi-threaded environments.


## LiteAgent Thread Safety

The lite package also provides thread-safe operations:
Expand Down Expand Up @@ -112,6 +116,8 @@ with threading.ThreadPoolExecutor(max_workers=5) as executor:
|-----------|-----------|--------|
| chat_history | `Lock` | Simple mutual exclusion |
| caches | `RLock` | Allows reentrant access |
| `RateLimiter` (sync) | `threading.Lock` | Protects `_tokens`, `_api_tokens`, and refill state from races in multi-threaded acquire calls |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The table entry for RateLimiter (sync) incorrectly identifies threading.Lock as the protection mechanism for internal state. In reality, the implementation does not use this lock during acquisition calls, leaving the state vulnerable to races in multi-threaded contexts. The documentation should be corrected to reflect the actual implementation or the implementation should be updated to match these claims.

| `RateLimiter` (async) | `asyncio.Lock` | Same protection for coroutine contexts |

### Lock Usage Pattern

Expand Down