Skip to content

Top 3 Architectural Gaps: Sync/Async Duplication, Sequential Tool Execution, Fragmented Streaming #1392

@MervinPraison

Description

@MervinPraison

Summary

In-depth architectural analysis of the core SDK (praisonaiagents) identified 3 key gaps that directly violate the project's stated principles (DRY, performance-first, protocol-driven, minimal API). These are distinct from the production robustness issues tracked in #1370 and the handoff unification in #1290 — they concern the execution core (LLM dispatch, tool calling, streaming) rather than session/concurrency/error semantics.

Each gap below includes verified file:line references, user-facing impact, and a concrete protocol-driven fix aligned with PraisonAI's philosophy.


Gap 1: Sync/Async Duplication in Core Execution Paths

Severity: High | Affects: Maintainability, feature parity, bug surface area

Core execution logic is duplicated across parallel sync and async implementations instead of sharing a single async-first core with a thin sync bridge.

Evidence

  • praisonaiagents/agent/chat_mixin.pychat() / _chat_impl() and achat() / _achat_impl() re-implement the same business logic (retrieval, context assembly, hook dispatch, guardrail application, tool invocation) in two places.
  • praisonaiagents/agent/execution_mixin.pyexecute_tool() vs execute_tool_async() maintain parallel code paths for the same concern.
  • praisonaiagents/llm/llm.pyget_response() vs get_response_async() carry overlapping provider dispatch, streaming decisions, retry/error logic, and message building.
  • Partial remediation already exists: praisonaiagents/agent/unified_chat_mixin.py (references Core SDK Gap #1: LLM layer monolith with dual execution paths and sync/async duplication #1304) introduces UnifiedLLMDispatcher for LLM dispatch, but the broader chat / tool-execution surface still has twin sync+async paths.

Impact

  • Every bug fix must be applied twice. Features silently drift (streaming, guardrails, context truncation, cost tracking) between sync and async surfaces.
  • Complicates contribution: new providers and features must remember to mirror changes in both paths.
  • Hot-path code is larger than necessary → higher cognitive load, more places for regressions.

Proposed Fix (protocol-driven, minimal API)

Finish the unification started in unified_chat_mixin.py across the remaining execution surface:

  1. Single async-first core: one async def _execute(...) method per concern (chat, tool execution, LLM dispatch).
  2. Thin sync bridge: sync entry points become asyncio.run(self._execute(...)) or use run_coroutine_threadsafe when already inside an event loop.
  3. Deprecate duplicated sync-only impls once parity is verified.
  4. Keep public API unchanged: agent.chat(...) / agent.achat(...), agent.start(...) / agent.astart(...) remain as-is.

Aligns with: DRY, protocol-driven core, "3-way feature surface" (CLI/YAML/Python all benefit from a single implementation).


Gap 2: LLM Tool Calls Execute Sequentially — No Parallelism for Batched Tool Calls

Severity: High | Affects: Latency, LLM efficiency, user experience

When the LLM returns multiple tool calls in a single assistant message (OpenAI, Anthropic, Gemini all support this), PraisonAI executes them one after another instead of in parallel.

Evidence

praisonaiagents/llm/llm.py:3305-3323 — verified sequential loop:

# Execute tool calls and add results to conversation
for tool_call in tool_calls:
    is_ollama = self._is_ollama_provider()
    function_name, arguments, tool_call_id = self._extract_tool_call_info(tool_call, is_ollama)
    try:
        tool_result = execute_tool_fn(function_name, arguments, tool_call_id=tool_call_id)
        tool_message = self._create_tool_message(function_name, tool_result, tool_call_id, is_ollama)
        messages.append(tool_message)
    except Exception as e:
        logging.error(f"Tool execution error for {function_name}: {e}")
        ...

Per-tool timeout infrastructure already exists in praisonaiagents/agent/tool_execution.py (execute_tool with timeout handling around lines 196-223), and concurrent.futures is imported at agent/agent.py:9 and agent/execution_mixin.py:336 — but only used for autonomous runs, not for batched tool calls.

Note: #1290 addressed parallel handoffs to other agents, which is a separate code path from batched LLM tool calls on a single agent — this gap is not covered there.

Impact

  • If an LLM requests, say, 3 independent tool calls (e.g. fetch_user, fetch_analytics, fetch_config) at 300 ms each, total latency is ~900 ms instead of ~300 ms.
  • Directly hurts agent workflows that batch I/O (web scraping, multi-source aggregation, parallel API queries) — the exact use-case where agent frameworks should shine.
  • User perceives the agent as "stuck" during tool phase because no tokens are streamed while tools run serially.

Proposed Fix (opt-in, protocol-driven, backward compatible)

Introduce a ToolCallExecutor protocol with two implementations, selected by a single Agent(parallel_tool_calls=True) flag (default off to preserve current behavior):

# praisonaiagents/tools/call_executor.py (new, lightweight protocol)
class ToolCallExecutor(Protocol):
    def execute_batch(self, calls: list[ToolCall]) -> list[ToolResult]: ...

class SequentialToolCallExecutor:
    # current behavior — safe default fallback
    ...

class ParallelToolCallExecutor:
    # thread pool for sync tools, asyncio.gather for async tools
    # respects existing per-tool timeout; bounded max_workers
    ...

Replace llm.py:3305-3323 with:

executor = self._get_tool_call_executor()  # picks parallel or sequential
tool_results = executor.execute_batch(tool_calls)
for function_name, tool_result, tool_call_id, is_ollama in tool_results:
    messages.append(self._create_tool_message(function_name, tool_result, tool_call_id, is_ollama))

Aligns with: performance-first, minimal API (one flag), protocol-driven core, no hot-path regressions (off by default).


Gap 3: Streaming Logic Is Fragmented Across Providers — Not Unified Behind a Protocol

Severity: Medium | Affects: Reliability, provider parity, future maintainability

Streaming capability decisions, error handling, and fallback behavior are scattered as provider-specific conditionals inside the main LLM loop rather than isolated behind a streaming-provider protocol.

Evidence

  • praisonaiagents/llm/llm.py:~2080-2300 contains tangled conditionals (Gemini-specific, Ollama-specific, generic-fallback) deciding whether streaming can proceed.

  • praisonaiagents/llm/adapters/__init__.py:176-182 — verified: Anthropic streaming is hard-disabled with a comment explaining a litellm/async-generator bug:

    def supports_streaming(self) -> bool:
        # litellm.acompletion with stream=True returns a ModelResponse (not async generator)
        # for Anthropic in the async path, causing 'async for requires __aiter__' error
        return False
    
    def supports_streaming_with_tools(self) -> bool:
        return False

    Users who request streaming with Anthropic silently fall back to non-streaming with no signal.

  • praisonaiagents/streaming/events.py defines StreamEvent types, but the actual streaming deltas emitted from llm.py don't consistently use this protocol — it's used mostly for trace/observability events.

  • Each provider's streaming error handling lives inline in llm.py rather than on the adapter, so adding a new provider requires editing the core loop instead of dropping in an adapter.

Impact

  • Inconsistent user expectations: some providers stream, some don't, tools-plus-streaming combos vary by provider — behavior is hard to predict without reading source.
  • Silent fallback from streaming to non-streaming is invisible to callers who rely on stream=True for perceived latency.
  • Adding a new LLM provider touches core llm.py instead of just an adapter — violates protocol-driven principle.
  • The existing StreamEvent protocol is underused, leaving streaming semantics informal.

Proposed Fix (complete the existing adapter pattern)

  1. Extend adapters in praisonaiagents/llm/adapters/ with a full streaming interface:

    class LLMAdapter(Protocol):
        def can_stream(self, *, tools: list | None, **kwargs) -> bool: ...
        async def stream_completion(self, **kwargs) -> AsyncIterator[StreamEvent]: ...
        def is_stream_error_recoverable(self, exc: Exception) -> bool: ...
  2. Move the Gemini/Ollama/Anthropic-specific branches out of llm.py:2080-2300 and onto each adapter.

  3. Standardize streaming deltas on streaming/events.py::StreamEvent so downstream consumers (UI, telemetry, hooks) have a stable contract.

  4. When an adapter cannot stream in the requested configuration, emit an observable StreamUnavailableEvent instead of silent fallback — users can log, warn, or choose a different provider.

Aligns with: protocol-driven core, DRY, production-ready (transparent fallback), extensibility (new providers = new adapter only).


Why these three, and not others

Filtered out by scope:

The remaining highest-leverage gaps are: the core execution paths are still duplicated (Gap 1), the hottest performance path loses 2–3x on batched tool calls (Gap 2), and the provider abstraction leaks into the core loop for streaming (Gap 3). Each of these directly violates a stated pillar and each has a concrete protocol-driven fix.

Suggested Priority

  1. Gap 2 — Parallel tool execution. Smallest change, largest user-visible performance win, opt-in (zero regression risk).
  2. Gap 1 — Sync/async unification. Pays down long-term maintenance cost; builds on existing UnifiedLLMDispatcher / unified_chat_mixin.py (Core SDK Gap #1: LLM layer monolith with dual execution paths and sync/async duplication #1304) work.
  3. Gap 3 — Streaming protocol. Medium-effort cleanup; unlocks reliable multi-provider streaming and new-provider onboarding.

Acceptance Criteria (high level)

  • Gap 1: chat/tool-execution sync paths delegate to a shared async core; no duplicated business logic between _chat_impl / _achat_impl and execute_tool / execute_tool_async.
  • Gap 2: Agent(parallel_tool_calls=True) runs batched LLM tool calls concurrently with bounded workers, per-tool timeout, and identical result ordering semantics; default remains sequential.
  • Gap 3: Each LLM adapter owns its streaming capability decisions and error recovery; llm.py contains no provider-name conditionals for streaming; Anthropic streaming either works end-to-end via the adapter or surfaces an explicit StreamUnavailableEvent.
  • Real agentic test per feature: a live agent.start("…") that exercises each fix against an actual LLM, not just object construction.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingclaudeAuto-trigger Claude analysisenhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions