From c4e482b2b341167b0ab3fee4f1835e2e4861cf07 Mon Sep 17 00:00:00 2001 From: MervinPraison <454862+MervinPraison@users.noreply.github.com> Date: Fri, 24 Apr 2026 10:17:17 +0000 Subject: [PATCH] docs: Update streaming.mdx to cover tool follow-up retries and new in-stream error messages - Add 'Streaming with Tools' section with Mermaid diagram explaining two-phase flow - Add 'Error Handling in the Stream' section documenting new error sentinel format - Update 'Handle errors in callbacks' accordion to explain both layers of error handling - Add troubleshooting entry for '[Error: ... ref: followup-...]' messages - Extend Related cards to include Rate Limiter with 3-column layout - Add cross-link in rate-limiter.mdx explaining shared rate limiting behavior Fixes #247 Co-authored-by: Mervin Praison --- docs/features/rate-limiter.mdx | 2 +- docs/features/streaming.mdx | 97 +++++++++++++++++++++++++++++++++- 2 files changed, 96 insertions(+), 3 deletions(-) diff --git a/docs/features/rate-limiter.mdx b/docs/features/rate-limiter.mdx index 60ed96a99..9dc76760d 100644 --- a/docs/features/rate-limiter.mdx +++ b/docs/features/rate-limiter.mdx @@ -5,7 +5,7 @@ description: "Token bucket rate limiting for LLM API calls" ## Overview -Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. +Control API request rates with token bucket algorithm. Prevents rate limit errors and manages costs. The rate limiter is shared by both the initial LLM call and the follow-up call that runs after tool execution in streaming mode — you don't need to configure them separately. ## Quick Start diff --git a/docs/features/streaming.mdx b/docs/features/streaming.mdx index 641a44e94..f86680aef 100644 --- a/docs/features/streaming.mdx +++ b/docs/features/streaming.mdx @@ -172,6 +172,89 @@ asyncio.run(main()) --- +## Streaming with Tools + +When your agent uses tools, streaming happens in two phases: the initial response that decides to call tools, and a follow-up response that synthesizes the tool results. + +```mermaid +sequenceDiagram + participant U as User + participant A as Agent + participant L as LLM + participant T as Tools + + U->>A: Request with stream=True + A->>L: Phase 1 (streamed) + L-->>A: "I'll use tool_name..." + A->>T: Execute tool_name() + T-->>A: Tool result + A->>L: Phase 2 follow-up (streamed) + L-->>A: Synthesized response + A-->>U: Combined stream + + Note over L: Both phases use retry-wrapped LLM calls +``` + +```python +from praisonaiagents import Agent, tool + +@tool +def get_weather(city: str) -> str: + """Get weather for a city.""" + return f"Weather in {city}: 72°F, sunny" + +agent = Agent( + instructions="You are a weather assistant", + tools=[get_weather] +) + +for chunk in agent.start("What's the weather in Paris?", stream=True): + print(chunk, end="", flush=True) +``` + +Both phases go through the same retry-wrapped LLM path, so transient rate-limit or network errors are retried automatically without any caller intervention. + +--- + +## Error Handling in the Stream + +If the LLM call fails after retries, the stream ends with a visible error sentence instead of silently dropping. + +You may receive this exact sentinel string: + +``` +[Error: Failed to generate final response after tool execution (ref: followup-1713957912345). Please retry. If it continues, try reducing prompt size.] +``` + +| Part | Meaning | +|------|---------| +| `ref: followup-` | Correlation ID logged server-side — share this when reporting issues | +| `Please retry` | Retries already ran internally; another attempt may succeed if the root cause was transient | +| `reducing prompt size` | Common root cause is context-length or provider capacity errors | + +Detect the error sentinel in your stream consumer: + +```python +from praisonaiagents import Agent + +agent = Agent(instructions="You are a helpful assistant", tools=[...]) + +full = "" +for chunk in agent.iter_stream("Research and summarize quantum computing"): + full += chunk + print(chunk, end="", flush=True) + +if "[Error:" in full and "ref:" in full: + # Surface ref to your logs / retry externally + print(f"\n⚠️ Error detected, check logs for correlation ID") +``` + + +The **initial** LLM call and the **follow-up** LLM call (after tool execution) now share the same retry and rate-limiting behavior — users no longer need to add their own retry wrapper around streaming + tools. + + +--- + ## StreamEvent Protocol Every streaming chunk emits a `StreamEvent` with full context. @@ -284,7 +367,7 @@ praisonai chat --stream --verbose "Explain quantum computing" - The emitter catches callback exceptions silently to avoid breaking the stream. Log errors inside your callback. + Two layers of error handling. Callback exceptions are still caught by the emitter to avoid breaking the stream — log them inside your callback. LLM call failures, however, are now retried automatically and, on persistent failure, surface as a visible `[Error: ... (ref: ...)]` sentence at the end of the stream — check for this sentinel when consuming `iter_stream()`. @@ -303,15 +386,25 @@ This is TTFT, not buffering. The model is generating the first token. Check: Normal. Providers may batch tokens for efficiency. +### "Stream ends with `[Error: Failed to generate final response after tool execution (ref: followup-...)]`" + +The follow-up LLM call (the one that synthesizes tool results into a final answer) failed after the built-in retries. Common causes: +- Persistent rate limit — pair streaming with a [Rate Limiter](/docs/features/rate-limiter) at higher RPM, or back off the caller. +- Context-length overflow — reduce conversation history or tool-result size. +- Provider outage — include the `ref:` ID when reporting. The internal log line (`ref=..., model=..., error=...`) makes it searchable. + --- ## Related - + Output formatting options Async agent execution + + Control request rates across initial and follow-up LLM calls +