Skip to content

perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends#2910

Closed
Sydney Runkle (sydney-runkle) wants to merge 12 commits into
mainfrom
sr/delta-channel-perf-combo
Closed

perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends#2910
Sydney Runkle (sydney-runkle) wants to merge 12 commits into
mainfrom
sr/delta-channel-perf-combo

Conversation

@sydney-runkle
Copy link
Copy Markdown
Collaborator

@sydney-runkle Sydney Runkle (sydney-runkle) commented Apr 23, 2026

Summary

Bundles multiple perf improvements against the latest PyPI releases of langchain / langgraph. Deepagents is pinned to combined sr/deepagents-perf-combo branches on both upstream repos.

Each optimization is a standalone change — the catalog further down lists every contributing commit and which upstream branch it lives on.

Companion branches

  • langchain-ai/langchain#sr/deepagents-perf-combo
  • langchain-ai/langgraph#sr/deepagents-perf-combo

Updated benchmark: heavier realistic workload (LG 1.2.0a1 + LC PR #37101)

Workload: create_deep_agent + InMemorySaver, parallel tool dispatch (all tools in one model call). Per turn: 2 × 30 KB file writes + 1 × 4 KB tool result (stays in messages) + 1 × 20 KB AI response; every 10th turn an 80 KB tool result is evicted to files by FilesystemMiddleware. ~6,144 tokens/turn~1.23M tokens at N=200.

Envs tested:

  • Baseline: LG 1.1.9 + LC 1.2.15, no delta, no schema cache
  • Delta only: LG 1.2.0a1 + LC 1.2.16, delta channel, no schema cache
  • Full combo: LG 1.2.0a1 + LC PR #37101 (tool schema caching), delta channel

N=200 head-to-head

config run time checkpoint storage get_state
baseline: no delta 17.3s 28 GB 16 ms
delta only (no schema cache) 16.4s 78 MB 22 ms
delta + schema cache 5.4s 78 MB 24 ms

The 28 GB baseline breaks down as: __pregel_tasks 9.8 GB (54%) + messages 6.7 GB (37%) + files 1.7 GB (9%) in blobs, plus ~10 GB of the same in the writes table. Two distinct O(N²) problems:

channel baseline (LC 1.2.15) LC 1.2.16 + DeltaChannel
__pregel_tasks O(N²) — full state inlined per Send O(N) fixed by stop-inlining same
messages O(N²) O(N²) O(N)
files O(N²) O(N²) O(N)
  • LC 1.2.16 stop-inlining fix addresses the dominant 54% of baseline storage independently of delta channel.
  • Delta channel takes messages + files from O(N²) to O(N) — the remaining 46%.
  • Schema caching (PR #37101): 16.4s → 5.4s (3× speedup). _create_subset_model_v2 was rebuilding 9 unique Pydantic models from scratch ~10,800 times per 50 turns.
  • Combined vs baseline: 17.3s → 5.4s (3.2×), 28 GB → 78 MB (360×).

Profiling: what's in the remaining 5.4s (delta + schema cached, N=200)

category share
Thread pool spin-up/join (3 parallel tools/turn) ~50%
Checkpoint reads (ormsgpack.unpackb) ~15%
LangGraph pregel loop / routing ~15%
builtins.repr (LangSmith tracing) ~10%
Token counting in FilesystemMiddleware ~5%
Everything else ~5%

No single hotspot dominates — this is the LangGraph execution floor for the current architecture.

Snapshot frequency (delta + schema cached, N=200, ~1.23M tokens)

snapshot_frequency is in pregel steps; ~8 steps/turn with parallel dispatch.

config run time storage get_state
snap=None 4.6s 77 MB 22 ms
snap=5 turns (40 steps) 4.3s 715 MB 14 ms
snap=10 turns (80 steps) 4.1s 388 MB 16 ms
snap=25 turns (200 steps) 4.1s 192 MB 17 ms

snap=None wins on storage (O(N) — only sentinels in blobs, deltas in writes table). Snapshots are still O(N²) total because each snapshot blob captures the full accumulated state. Read time differences are ~2 ms on InMemorySaver — real gap requires Postgres to quantify.


Original benchmark setup (lighter workload)

  • Workload: create_deep_agent + InMemorySaver, deterministic mock model. Per turn: 1 × 1 KB file write + 1 tool call. Tool results are 2 KB most turns; every 10th turn returns 82 KB which exceeds the 20k-token threshold and is evicted into the files channel by FilesystemMiddleware. Per-turn state growth ≈ 11 KB ≈ ~2.75k tokens (4 chars/token). At N=200 that accumulates to ~550k tokens — a realistic long-running agent thread.
  • Reproduce: uv run --project libs/deepagents python bench_perf_combo.py (script + results in this PR).
  • Metrics:
    • Elapsed = time.perf_counter() around the N-turn invoke loop (construction excluded).
    • Peak memory = tracemalloc.get_traced_memory()[1] — Python allocator peak across the full loop.
    • Checkpoint storage = serialized bytes resident in InMemorySaver.{blobs, writes, storage} at end of run (what a durable saver would persist).

Headline: async — baseline → full combo across N

Wall clock

N ~tokens baseline combo speedup
5 14k 1.65s 0.08s 21×
25 70k 8.72s 0.56s 16×
50 140k 17.33s 1.67s 10×
100 280k 37.08s 5.96s 6.2×
200 550k 79.15s 20.37s 3.9×

Peak memory

N ~tokens baseline combo reduction
5 14k 2.3 MB 0.6 MB 3.8×
25 70k 32.5 MB 2.0 MB 16×
50 140k 120.3 MB 3.5 MB 34×
100 280k 467.7 MB 7.0 MB 67×
200 550k 1950 MB 14.6 MB 134×

Checkpoint storage

N ~tokens baseline combo reduction
5 14k 1.3 MB 0.2 MB 7.2×
25 70k 30.1 MB 0.8 MB 40×
50 140k 117.2 MB 1.4 MB 83×
100 280k 461.9 MB 2.9 MB 162×
200 550k 1939 MB 6.2 MB 314×

durability="exit"

Wall clock

N baseline combo speedup
25 8.53s 0.28s 30×
100 35.32s 1.19s 30×
200 75.34s 2.61s 29×

(Peak memory and storage reductions are even larger in exit mode because there's only one checkpoint per invoke — see bench_results.json.)


Ablation: who contributes what (async wall-clock)

Five configs, each adds one change cumulatively. Every perf commit is isolated to exactly one step — no bundling.

Config langchain langgraph
A baseline released 1.2.15 released 1.1.9
B1 + DeltaChannel only sr/perf-delta-channel-only delta-channel-writes-based
B2 + tool_call_schema caching (5 langchain-core commits) sr/delta-channel-messages same as B1
C + add_messages fast-path same as B2 sr/deepagents-perf-delta-plus-addmsgs
D + Sends/hydrate + new openai-dict/chars caching + truncate-args dedup sr/deepagents-perf-combo sr/deepagents-perf-combo

Wall clock (s)

N A B1 B2 C D full speedup
5 1.65 1.51 0.31 0.32 0.08 21×
25 8.72 7.93 1.92 1.96 0.56 16×
50 17.33 16.94 5.06 4.73 1.67 10×
100 37.08 40.28 16.54 12.46 5.96 6.2×
200 79.15 124.13 72.97 35.24 20.37 3.9×

Surprising finding: DeltaChannel alone (B1) is slower than baseline at high N in async mode — 124s vs 79s at N=200. DeltaChannel saves per-step serialization cost but pays an O(N²) load-time cost (each invoke's channels_from_checkpoint walks the delta write history and replays through add_messages). Without the other optimizations to offset that, async becomes net-negative for wall clock. Memory and storage still drop by ~40% in B1 — the slowdown is pure CPU.

Which step delivers what (async elapsed):

  • A→B1 (DeltaChannel alone): noisy / negative at N≥100 on wall clock. Saves ~40% memory and storage.
  • B1→B2 (tool_call_schema caching, 5 commits): dominant wall-clock win — N=200 goes 124s → 73s (−41%). Zero memory/storage effect.
  • B2→C (add_messages fast-path): matters at high N — N=200 goes 73s → 35s (−52%). Zero memory/storage effect.
  • C→D (Sends/hydrate + today's 3 fixes): big jump — N=200 goes 35s → 20s (−42%). Also kills the remaining 60% of memory/storage.

Peak memory (MB)

N A B1 B2 C D full reduction
5 2.3 1.8 1.5 1.6 0.6 3.8×
25 32.5 20.9 20.3 20.3 2.0 16×
50 120.3 74.3 74.3 74.3 3.5 34×
100 467.7 284.1 283.5 283.6 7.0 67×
200 1950.0 1191.8 1191.7 1191.6 14.6 134×

Checkpoint storage (MB)

N A B1 B2 C D full reduction
5 1.3 0.8 0.8 0.8 0.2 7.2×
25 30.1 18.2 18.2 18.2 0.8 40×
50 117.2 70.8 70.8 70.8 1.4 83×
100 461.9 277.4 277.4 277.4 2.9 162×
200 1939.2 1179.2 1179.2 1179.2 6.2 314×

Memory/storage story: DeltaChannel (B1) takes ~40% off; Sends/hydrate (C→D) handles the remaining ~60%. tool_call_schema caching and add_messages are pure CPU — zero effect on these axes.


Individual optimizations (catalog with links)

Every perf change is self-contained on its own upstream branch.

langgraph

Optimization Branch Key commit(s)
DeltaChannel channel type + InMemorySaver / PostgresSaver reconstruct path delta-channel-writes-based Series ending at afec98f3 (rename to channels/_delta)
add_messages fast-path (skip left-side conversion, append-only short-circuit) optimize/add-messages-fast-path 2a974d1b + 3 follow-ups
ToolNode hydrate state from channels via CONFIG_KEY_READ sr/tool-call-no-state-inline 18cbe46b

langchain

Optimization Branch Commit
AgentState.messages annotated with DeltaChannel sr/perf-delta-channel-only 5f2da29a
BaseTool.tool_call_schema + .args as cached_property sr/delta-channel-messages 8669f027
Invalidate cached tool_call_schema / args on field mutation same f54de971
_create_subset_modellru_cache same 63dd915a
Avoid repeated tool_call_schema access in _format_tool_to_openai_function same 29979178
Defer tracer imports in runnables/base.py same 21237799
Send("tools", [call]) no longer inlines full state sr/tool-call-context-fix 22753f68 + state_keys= drop fixup
NEW: Cache _format_tool_to_openai_function(tool) dict on tool instance sr/summarization-tool-schema-cache-spike 06e351a4
NEW: Cache len(json.dumps(tool_dict)) on tool instance (avoid re-dumping) same 43faee1f

deepagents (this PR)

Optimization Commit
DeltaChannel applied to messages + files 1cdebbdc
Drop typ=dict (new DeltaChannel infers from Annotated); track _delta rename 9a7e71af
Pass tools=[] to ToolRuntime in production middlewares + tests e5817dba
NEW: Reuse token count between _truncate_args and _should_summarize 268b8558

Files touched in this PR

  • libs/deepagents/pyproject.toml, uv.lock (+ sibling workspace locks): pin langchain/langgraph packages to sr/deepagents-perf-combo.
  • libs/deepagents/deepagents/middleware/filesystem.py: drop typ=dict kwarg; track channels/_delta rename.
  • libs/deepagents/deepagents/middleware/summarization.py: return total_tokens from _truncate_args, reuse for _should_summarize.
  • 4 production middlewares + test fixtures: pass tools=[] to ToolRuntime (new required arg introduced by the ToolNode hydrate change).
  • bench_perf_combo.py: reproducer for the tables above.

Test plan

  • libs/deepagents unit tests (1181 passed / 0 failed, 1 xfail tracking upstream DeltaChannel + add_messages dedup edge case)
  • libs/langchain_v1 agents unit tests (2 failures tracked — upstream _DeltaSentinel msgpack serde gap in a test-only saver)
  • CLI / repl / partner workspaces: locks refreshed, tests pass
  • bench_perf_combo.py reproduces the tables above locally

🤖 Generated with Claude Code

- AgentState.messages now uses DeltaChannel(add_messages) (via
  langchain sr/delta-channel-messages) — checkpoint storage drops
  from O(N²) to O(N) for long-running threads
- FilesystemState.files now uses DeltaChannel(_file_data_reducer)
  directly; removes the now-redundant DeltaFilesystemMiddleware
- pyproject.toml sources langchain, langchain-core, langsmith,
  langgraph, and langgraph-sdk from their respective dev branches
  via git URLs so testers can install without local clones

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Drop langsmith git override (PyPI is fine). Point langgraph + sdk
to diff-channel-incremental-checkpointing instead of sr/deferred-imports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Also drops the langchain-core override-dependencies workaround —
delta-channel-writes-based uses langchain-core>=1.3.0,<2 (no exact pin).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e Sends

Repoints all langchain/langgraph sources to the combined `sr/deepagents-perf-combo`
branches, which stack three perf improvements:

1. DeltaChannel for `messages` + `files` (langgraph + langchain AgentState):
   checkpoint storage drops from O(N²) to O(N) for long threads.
2. `add_messages` fast-path (langgraph): skip left-side conversion and
   fast-path pure appends on the hot add_messages call.
3. No-state-inline tool dispatch (langchain + langgraph ToolNode):
   `Send("tools", [tool_call])` no longer carries a serialized snapshot of
   the full messages list; ToolNode hydrates state from channels via
   CONFIG_KEY_READ at execution time. Eliminates O(N²) __pregel_tasks growth.

Also drops `typ=dict` from the FilesystemState DeltaChannel — the new
DeltaChannel infers type from the Annotated outer type.

Combo branches:
- langchain-ai/langchain#sr/deepagents-perf-combo
- langchain-ai/langgraph#sr/deepagents-perf-combo

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot changed the title perf(deepagents): DeltaChannel + add_messages fast-path + no-inline Sends perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends Apr 23, 2026
@github-actions github-actions Bot added deepagents Related to the `deepagents` SDK / agent harness dependencies Pull requests that update a dependency file internal User is a member of the `langchain-ai` GitHub organization performance Code change that improves performance size: XS < 50 LOC labels Apr 23, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 23, 2026

Merging this PR will not alter performance

✅ 32 untouched benchmarks
⏩ 15 skipped benchmarks1


Comparing sr/delta-channel-perf-combo (268b855) with main (f4a2309)

Open in CodSpeed

Footnotes

  1. 15 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Standalone deterministic benchmark (mock model, no API calls) that runs
create_deep_agent across multiple (N, durability) configs and writes a
bench_results.json with wall-clock, tracemalloc peak, and per-store
checkpoint storage for each config.

Usage:
    uv run --project libs/deepagents python bench_perf_combo.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size: M 200-499 LOC and removed size: XS < 50 LOC labels Apr 23, 2026
@mdrxy Mason Daugherty (mdrxy) changed the title perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends Apr 23, 2026
Upstream langgraph moved channels/delta.py → channels/_delta.py and
removed the public re-export to signal the API is still experimental.
Update the FilesystemState import and refresh pinned commits on both
combo branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ate API

The ToolNode hydrate-from-channels change added a required `tools` arg to
`ToolRuntime.__init__`; update all test fixtures and the four production
middlewares that build a synthetic ToolRuntime (summarization / filesystem /
memory / skills) to pass `tools=[]`.

Memory + skills default-backend tests now use `agent.get_state(config).values`
instead of reading `checkpoint['channel_values'][...]` directly — DeltaChannel
stores a sentinel in the raw checkpoint, and get_state resolves it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Run make lock across libs/* and examples/* so every workspace pulls in
  the renamed langgraph.channels._delta module (otherwise cli / runloop /
  example lockfiles import from the old channels.delta path and CI fails
  with ModuleNotFoundError).
- Apply ruff format to files touched by the automated tools=[] injection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DeltaChannel(add_messages) replay re-applies each step delta through
add_messages, but across-step ID dedup isn't preserved — so an evicted
HumanMessage and its replacement both survive reconstruction. The test
asserts the replacement alone should win, which is correct behaviour
against the baseline reducer.

Mark xfail strict=True so we notice once the upstream fix lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also relax the multi_turn_eviction xfail to strict=False — the bug only
reproduces on macOS; Linux CI passes the test under DeltaChannel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added cli Related to `deepagents-cli` repl labels Apr 24, 2026
Previous params (30KB×2 writes + 80KB tool result every turn) accumulated
~10M tokens of state at N=200 — an unrealistic stress test. Resized to
~500k tokens at N=200: 1 KB file write every turn, small (2 KB) tool
results most turns, with a larger 82 KB result every 10th turn that still
triggers FilesystemMiddleware eviction into the `files` channel so the
eviction path stays exercised.

Also enables N=200 on async mode now that baseline peak fits (was OOMing
at the old payload size).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ould_summarize

_truncate_args already calls token_counter(messages, tools=...) to decide
whether to truncate. The subsequent _should_summarize check was re-running
the same count on the same messages and tools — ~10s of duplicate work per
100-turn async run. Have _truncate_args return (messages, modified, total_tokens)
and reuse that count downstream. If truncation modifies messages, recount.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cli Related to `deepagents-cli` deepagents Related to the `deepagents` SDK / agent harness dependencies Pull requests that update a dependency file internal User is a member of the `langchain-ai` GitHub organization performance Code change that improves performance size: M 200-499 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant