Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/features/rate-limiter.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ description: "Token bucket rate limiting for LLM API calls"

## Overview

Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs.
Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.

## Quick Start

Expand Down
97 changes: 95 additions & 2 deletions docs/features/streaming.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,89 @@ asyncio.run(main())

---

## Streaming with Tools

When your agent uses tools, streaming happens in two phases: the initial response that decides to call tools, and a follow-up response that synthesizes the tool results.

```mermaid
sequenceDiagram
participant U as User
participant A as Agent
participant L as LLM
participant T as Tools

U->>A: Request with stream=True
A->>L: Phase 1 (streamed)
L-->>A: "I'll use tool_name..."
A->>T: Execute tool_name()
T-->>A: Tool result
A->>L: Phase 2 follow-up (streamed)
L-->>A: Synthesized response
A-->>U: Combined stream

Note over L: Both phases use retry-wrapped LLM calls
```
Comment on lines +179 to +196
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The new Mermaid sequence diagram is missing the standard color scheme mentioned in the PR checklist and used in other diagrams in this file. Applying these colors ensures visual consistency across the documentation.

sequenceDiagram
    participant U as User #6366F1
    participant A as Agent #F59E0B
    participant L as LLM #189AB4
    participant T as Tools #10B981
    
    U->>A: Request with stream=True
    A->>L: Phase 1 (streamed)
    L-->>A: "I'll use tool_name..."
    A->>T: Execute tool_name()
    T-->>A: Tool result
    A->>L: Phase 2 follow-up (streamed) 
    L-->>A: Synthesized response
    A-->>U: Combined stream
    
    Note over L: Both phases use retry-wrapped LLM calls


```python
from praisonaiagents import Agent, tool

@tool
def get_weather(city: str) -> str:
"""Get weather for a city."""
return f"Weather in {city}: 72°F, sunny"

agent = Agent(
instructions="You are a weather assistant",
tools=[get_weather]
)

for chunk in agent.start("What's the weather in Paris?", stream=True):
print(chunk, end="", flush=True)
```

Both phases go through the same retry-wrapped LLM path, so transient rate-limit or network errors are retried automatically without any caller intervention.

---

## Error Handling in the Stream

If the LLM call fails after retries, the stream ends with a visible error sentence instead of silently dropping.

You may receive this exact sentinel string:

```
[Error: Failed to generate final response after tool execution (ref: followup-1713957912345). Please retry. If it continues, try reducing prompt size.]
```

| Part | Meaning |
|------|---------|
| `ref: followup-<timestamp>` | Correlation ID logged server-side — share this when reporting issues |
| `Please retry` | Retries already ran internally; another attempt may succeed if the root cause was transient |
| `reducing prompt size` | Common root cause is context-length or provider capacity errors |

Detect the error sentinel in your stream consumer:

```python
from praisonaiagents import Agent

agent = Agent(instructions="You are a helpful assistant", tools=[...])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The code example uses tools=[...], which is not valid Python and violates the 'copy-paste runnable' requirement mentioned in the PR checklist. Please use an empty list or a valid tool reference.

agent = Agent(instructions="You are a helpful assistant", tools=[])


full = ""
for chunk in agent.iter_stream("Research and summarize quantum computing"):
full += chunk
print(chunk, end="", flush=True)

if "[Error:" in full and "ref:" in full:
# Surface ref to your logs / retry externally
print(f"\n⚠️ Error detected, check logs for correlation ID")
```
Comment on lines +237 to +250
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Replace the tools=[...] placeholder with a runnable example.

The literal [...] makes this snippet fail to execute on copy-paste, which contradicts the documentation standard that every Python example must run unmodified. Reuse the get_weather tool from the section above (or any concrete function) so readers can actually reproduce the sentinel-detection flow.

♻️ Proposed fix
-from praisonaiagents import Agent
-
-agent = Agent(instructions="You are a helpful assistant", tools=[...])
+from praisonaiagents import Agent
+
+def get_weather(city: str) -> str:
+    """Get weather for a city."""
+    return f"Weather in {city}: 72°F, sunny"
+
+agent = Agent(instructions="You are a helpful assistant", tools=[get_weather])

As per coding guidelines: "Every Python code example must include all necessary imports and run without modification".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/features/streaming.mdx` around lines 237 - 250, The code example uses a
non-runnable placeholder tools=[...] which breaks copy-paste; replace it with a
concrete tool (eg. the existing get_weather function from the previous section)
and include the necessary import/registration so the Agent instantiation is
self-contained: import or define get_weather, wrap it as a tool the Agent
accepts (matching the project’s tool API), pass tools=[get_weather] into
Agent(...), and keep the rest (full accumulation and sentinel detection using
iter_stream) unchanged so the snippet runs unmodified and demonstrates the
sentinel-detection flow with Agent.iter_stream.


<Note>
The **initial** LLM call and the **follow-up** LLM call (after tool execution) now share the same retry and rate-limiting behavior — users no longer need to add their own retry wrapper around streaming + tools.
</Note>

---

## StreamEvent Protocol

Every streaming chunk emits a `StreamEvent` with full context.
Expand Down Expand Up @@ -284,7 +367,7 @@ praisonai chat --stream --verbose "Explain quantum computing"
</Accordion>

<Accordion title="Handle errors in callbacks">
The emitter catches callback exceptions silently to avoid breaking the stream. Log errors inside your callback.
Two layers of error handling. Callback exceptions are still caught by the emitter to avoid breaking the stream — log them inside your callback. LLM call failures, however, are now retried automatically and, on persistent failure, surface as a visible `[Error: ... (ref: ...)]` sentence at the end of the stream — check for this sentinel when consuming `iter_stream()`.
</Accordion>
</AccordionGroup>

Expand All @@ -303,15 +386,25 @@ This is TTFT, not buffering. The model is generating the first token. Check:

Normal. Providers may batch tokens for efficiency.

### "Stream ends with `[Error: Failed to generate final response after tool execution (ref: followup-...)]`"

The follow-up LLM call (the one that synthesizes tool results into a final answer) failed after the built-in retries. Common causes:
- Persistent rate limit — pair streaming with a [Rate Limiter](/docs/features/rate-limiter) at higher RPM, or back off the caller.
- Context-length overflow — reduce conversation history or tool-result size.
- Provider outage — include the `ref:` ID when reporting. The internal log line (`ref=..., model=..., error=...`) makes it searchable.

---

## Related

<CardGroup cols={2}>
<CardGroup cols={3}>
<Card title="Output & Display" icon="display" href="/docs/features/display-system">
Output formatting options
</Card>
<Card title="Async" icon="clock" href="/docs/features/async">
Async agent execution
</Card>
<Card title="Rate Limiter" icon="gauge" href="/docs/features/rate-limiter">
Control request rates across initial and follow-up LLM calls
</Card>
</CardGroup>