From c4e482b2b341167b0ab3fee4f1835e2e4861cf07 Mon Sep 17 00:00:00 2001
From: MervinPraison <454862+MervinPraison@users.noreply.github.com>
Date: Fri, 24 Apr 2026 10:17:17 +0000
Subject: [PATCH] docs: Update streaming.mdx to cover tool follow-up retries
 and new in-stream error messages

- Add 'Streaming with Tools' section with Mermaid diagram explaining two-phase flow
- Add 'Error Handling in the Stream' section documenting new error sentinel format
- Update 'Handle errors in callbacks' accordion to explain both layers of error handling
- Add troubleshooting entry for '[Error: ... ref: followup-...]' messages
- Extend Related cards to include Rate Limiter with 3-column layout
- Add cross-link in rate-limiter.mdx explaining shared rate limiting behavior

Fixes #247

Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
---
 docs/features/rate-limiter.mdx |  2 +-
 docs/features/streaming.mdx    | 97 +++++++++++++++++++++++++++++++++-
 2 files changed, 96 insertions(+), 3 deletions(-)
diff --git a/docs/features/rate-limiter.mdx b/docs/features/rate-limiter.mdx
index 60ed96a99..9dc76760d 100644
--- a/docs/features/rate-limiter.mdx
+++ b/docs/features/rate-limiter.mdx
@@ -5,7 +5,7 @@ description: "Token bucket rate limiting for LLM API calls"
 
 ## Overview
 
-Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs.
+Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately.
 
 ## Quick Start
 
diff --git a/docs/features/streaming.mdx b/docs/features/streaming.mdx
index 641a44e94..f86680aef 100644
--- a/docs/features/streaming.mdx
+++ b/docs/features/streaming.mdx
@@ -172,6 +172,89 @@ asyncio.run(main())
 
 ---
 
+## Streaming with Tools
+
+When your agent uses tools, streaming happens in two phases: the initial response that decides to call tools, and a follow-up response that synthesizes the tool results.
+
+```mermaid
+sequenceDiagram
+    participant U as User
+    participant A as Agent  
+    participant L as LLM
+    participant T as Tools
+    
+    U->>A: Request with stream=True
+    A->>L: Phase 1 (streamed)
+    L-->>A: "I'll use tool_name..."
+    A->>T: Execute tool_name()
+    T-->>A: Tool result
+    A->>L: Phase 2 follow-up (streamed) 
+    L-->>A: Synthesized response
+    A-->>U: Combined stream
+    
+    Note over L: Both phases use retry-wrapped LLM calls
+```
+
+```python
+from praisonaiagents import Agent, tool
+
+@tool
+def get_weather(city: str) -> str:
+    """Get weather for a city."""
+    return f"Weather in {city}: 72°F, sunny"
+
+agent = Agent(
+    instructions="You are a weather assistant",
+    tools=[get_weather]
+)
+
+for chunk in agent.start("What's the weather in Paris?", stream=True):
+    print(chunk, end="", flush=True)
+```
+
+Both phases go through the same retry-wrapped LLM path, so transient rate-limit or network errors are retried automatically without any caller intervention.
+
+---
+
+## Error Handling in the Stream
+
+If the LLM call fails after retries, the stream ends with a visible error sentence instead of silently dropping.
+
+You may receive this exact sentinel string:
+
+```
+[Error: Failed to generate final response after tool execution (ref: followup-1713957912345). Please retry. If it continues, try reducing prompt size.]
+```
+
+| Part | Meaning |
+|------|---------|
+| `ref: followup-<timestamp>` | Correlation ID logged server-side — share this when reporting issues |
+| `Please retry` | Retries already ran internally; another attempt may succeed if the root cause was transient |
+| `reducing prompt size` | Common root cause is context-length or provider capacity errors |
+
+Detect the error sentinel in your stream consumer:
+
+```python
+from praisonaiagents import Agent
+
+agent = Agent(instructions="You are a helpful assistant", tools=[...])
+
+full = ""
+for chunk in agent.iter_stream("Research and summarize quantum computing"):
+    full += chunk
+    print(chunk, end="", flush=True)
+
+if "[Error:" in full and "ref:" in full:
+    # Surface ref to your logs / retry externally
+    print(f"\n⚠️ Error detected, check logs for correlation ID")
+```
+
+<Note>
+The **initial** LLM call and the **follow-up** LLM call (after tool execution) now share the same retry and rate-limiting behavior — users no longer need to add their own retry wrapper around streaming + tools.
+</Note>
+
+---
+
 ## StreamEvent Protocol
 
 Every streaming chunk emits a `StreamEvent` with full context.
@@ -284,7 +367,7 @@ praisonai chat --stream --verbose "Explain quantum computing"
   </Accordion>
   
   <Accordion title="Handle errors in callbacks">
-    The emitter catches callback exceptions silently to avoid breaking the stream. Log errors inside your callback.
+    Two layers of error handling. Callback exceptions are still caught by the emitter to avoid breaking the stream — log them inside your callback. LLM call failures, however, are now retried automatically and, on persistent failure, surface as a visible `[Error: ... (ref: ...)]` sentence at the end of the stream — check for this sentinel when consuming `iter_stream()`.
   </Accordion>
 </AccordionGroup>
 
@@ -303,15 +386,25 @@ This is TTFT, not buffering. The model is generating the first token. Check:
 
 Normal. Providers may batch tokens for efficiency.
 
+### "Stream ends with `[Error: Failed to generate final response after tool execution (ref: followup-...)]`"
+
+The follow-up LLM call (the one that synthesizes tool results into a final answer) failed after the built-in retries. Common causes:
+- Persistent rate limit — pair streaming with a [Rate Limiter](/docs/features/rate-limiter) at higher RPM, or back off the caller.
+- Context-length overflow — reduce conversation history or tool-result size.
+- Provider outage — include the `ref:` ID when reporting. The internal log line (`ref=..., model=..., error=...`) makes it searchable.
+
 ---
 
 ## Related
 
-<CardGroup cols={2}>
+<CardGroup cols={3}>
   <Card title="Output & Display" icon="display" href="/docs/features/display-system">
     Output formatting options
   </Card>
   <Card title="Async" icon="clock" href="/docs/features/async">
     Async agent execution
   </Card>
+  <Card title="Rate Limiter" icon="gauge" href="/docs/features/rate-limiter">
+    Control request rates across initial and follow-up LLM calls
+  </Card>
 </CardGroup>