Skip to content

What's the recommended shape for chat orchestrators with variable step count (up to ~99) and total runtime that may exceed 800s? Per-step gap growth makes per-LLM-round splits unviable #1737

@shtefcs

Description

@shtefcs

Problem

Using DurableAgent with the Workflow SDK in development mode results in 7-12 seconds of overhead per workflow step, making even simple agent interactions take 2-4 minutes instead of 10-15 seconds. This makes development and testing of DurableAgent-based applications effectively impossible.

We are building a multi-agent AI orchestration platform with 10 agents, 56+ tools, and complex agent chaining (web research -> coding -> deployment). The original implementation using direct streamText calls completes in 10-30 seconds. The same functionality wrapped in DurableAgent takes 3-12 minutes due to per-step overhead.

Environment

  • workflow: 4.2.1
  • @workflow/ai: 4.1.1
  • @workflow/core: 4.2.1
  • @workflow/world-postgres: 4.1.0
  • Next.js: 16.1.6 (Turbopack)
  • Node.js: 22.x
  • Platform: WSL2 on Windows, 48GB RAM
  • Database: Neon Postgres (US East)

Measured Latency

Every POST /.well-known/workflow/v1/flow request takes 7-12 seconds:

POST flow 200 in 9.1s  (compile: 2ms, render: 9.1s)
POST flow 200 in 7.9s  (compile: 2ms, render: 7.9s)
POST flow 200 in 6.9s  (compile: 3ms, render: 6.9s)
POST flow 200 in 9.7s  (compile: 2ms, render: 9.7s)
POST flow 200 in 7.1s  (compile: 4ms, render: 7.1s)
POST flow 200 in 7.6s  (compile: 3ms, render: 7.6s)
POST flow 200 in 8.0s  (compile: 4ms, render: 8.0s)

The compile: 2-4ms confirms route compilation is cached. The render: 7-12s is pure SDK processing time.

A simple general agent query (1 LLM call, no tools) takes 3.5 minutes total. A coding agent session with 15 tool calls takes 8-12 minutes.

With the original direct streamText approach (no workflow), the same queries complete in 10-30 seconds.

Root Cause Analysis

We traced the issue through the SDK source code and identified three compounding factors:

1. Full event log replay on every flow request

In @workflow/core/dist/runtime/helpers.js, getAllWorkflowRunEvents() loads ALL events from the database on every flow request via paginated queries. The SDK developers have acknowledged this is suboptimal with a TODO comment at line 231-233:

"TODO: we're currently loading all the data with resolveRef behaviour. We need to update this to lazyload the data from the world instead so that we can optimize and make the event log loading much faster and memory efficient"

For a workflow with 30 completed steps, this means 30+ event records loaded from the database on every single flow request, even though only the latest step result is needed to resume.

2. Linear growth of step processing time

This matches the behavior reported in #1315, where step-to-step overhead grows linearly with completed step count (+0.62s per minute of runtime). Our DurableAgent workflows with 15-30 tool calls accumulate significant replay overhead as the workflow progresses.

3. Queue round-trip latency

Each step requires a full round-trip through the queue system:

  1. Flow handler suspends at step boundary
  2. Step queued via graphile-worker (or Vercel Queue)
  3. Step handler executes (the actual useful work: 1-5s for LLM calls)
  4. Step re-queues the flow
  5. Flow handler replays entire event log again

This matches the queue delay observations in #1160 where steps spend 4-5s in queue even when execution takes less than 200ms.

Impact on DurableAgent Development

DurableAgent splits each LLM call and each tool execution into separate workflow steps. A typical coding agent session involves:

  • 1 routing step
  • 1 prepareAgentConfig step
  • 5-15 doStreamStep + tool execution cycles (LLM call + tool call per cycle)
  • 1 processAgentResult step
  • 1 verify loop step
  • 1 finalization step

That is 10-20 steps minimum. At 7-12s overhead per step, the workflow overhead alone (excluding actual LLM/tool execution time) is 70-240 seconds. This makes development iteration painfully slow.

What We Have Tried

  • Local World: Faster step-to-step (in-memory queue) but still has 2-3s per step overhead from VM replay. Also causes OOM crashes on long sessions and loses state on server restart.
  • Postgres World with remote Neon DB: 7-12s per step due to network latency multiplied by event count.
  • Reducing step count: Combined post-agent steps from 5 to 3. Helps marginally but the fundamental per-step overhead remains.
  • Adjusting worker concurrency and pool size: No meaningful impact since the bottleneck is event replay, not worker scheduling.

Expected Behavior

Development mode should be usable for testing DurableAgent workflows. A reasonable target would be less than 500ms of SDK overhead per step, making a 20-step workflow add less than 10 seconds of total overhead.

Suggested Improvements

  1. Incremental event loading: Instead of loading ALL events on every flow request, maintain a checkpoint and load only events since the last processed step. The event log is append-only, so this is safe.

  2. In-process step execution for Local World: Instead of routing steps through HTTP (flow -> POST step -> step executes -> POST flow), execute step functions directly in the same process for Local World. This eliminates the queue round-trip entirely.

  3. Event log pagination with cursor caching: Cache the cursor position between flow requests so subsequent calls skip already-replayed events.

  4. Resume-from-checkpoint: Instead of replaying the entire workflow from the beginning, snapshot the VM state after each step and resume from the snapshot. This would make step processing O(1) instead of O(n).

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions