perf(sdk): `DeltaChannel` + `add_messages` fast-path + no-inline Sends by sydney-runkle · Pull Request #2910 · langchain-ai/deepagents

Sydney Runkle (sydney-runkle) · 2026-04-23T18:00:03Z

Summary

Bundles multiple perf improvements against the latest PyPI releases of langchain / langgraph. Deepagents is pinned to combined sr/deepagents-perf-combo branches on both upstream repos.

Each optimization is a standalone change — the catalog further down lists every contributing commit and which upstream branch it lives on.

Companion branches

langchain-ai/langchain#sr/deepagents-perf-combo
langchain-ai/langgraph#sr/deepagents-perf-combo

Updated benchmark: heavier realistic workload (LG 1.2.0a1 + LC PR #37101)

Workload: create_deep_agent + InMemorySaver, parallel tool dispatch (all tools in one model call). Per turn: 2 × 30 KB file writes + 1 × 4 KB tool result (stays in messages) + 1 × 20 KB AI response; every 10th turn an 80 KB tool result is evicted to files by FilesystemMiddleware. ~6,144 tokens/turn → ~1.23M tokens at N=200.

Envs tested:

Baseline: LG 1.1.9 + LC 1.2.15, no delta, no schema cache
Delta only: LG 1.2.0a1 + LC 1.2.16, delta channel, no schema cache
Full combo: LG 1.2.0a1 + LC PR #37101 (tool schema caching), delta channel

N=200 head-to-head

config	run time	checkpoint storage	get_state
baseline: no delta	17.3s	28 GB	16 ms
delta only (no schema cache)	16.4s	78 MB	22 ms
delta + schema cache	5.4s	78 MB	24 ms

The 28 GB baseline breaks down as: __pregel_tasks 9.8 GB (54%) + messages 6.7 GB (37%) + files 1.7 GB (9%) in blobs, plus ~10 GB of the same in the writes table. Two distinct O(N²) problems:

channel	baseline (LC 1.2.15)	LC 1.2.16	+ DeltaChannel
`__pregel_tasks`	O(N²) — full state inlined per Send	O(N) fixed by `stop-inlining`	same
`messages`	O(N²)	O(N²)	O(N)
`files`	O(N²)	O(N²)	O(N)

LC 1.2.16 stop-inlining fix addresses the dominant 54% of baseline storage independently of delta channel.
Delta channel takes messages + files from O(N²) to O(N) — the remaining 46%.
Schema caching (PR #37101): 16.4s → 5.4s (3× speedup). _create_subset_model_v2 was rebuilding 9 unique Pydantic models from scratch ~10,800 times per 50 turns.
Combined vs baseline: 17.3s → 5.4s (3.2×), 28 GB → 78 MB (360×).

Profiling: what's in the remaining 5.4s (delta + schema cached, N=200)

category	share
Thread pool spin-up/join (3 parallel tools/turn)	~50%
Checkpoint reads (`ormsgpack.unpackb`)	~15%
LangGraph pregel loop / routing	~15%
`builtins.repr` (LangSmith tracing)	~10%
Token counting in `FilesystemMiddleware`	~5%
Everything else	~5%

No single hotspot dominates — this is the LangGraph execution floor for the current architecture.

Snapshot frequency (delta + schema cached, N=200, ~1.23M tokens)

snapshot_frequency is in pregel steps; ~8 steps/turn with parallel dispatch.

config	run time	storage	get_state
snap=None	4.6s	77 MB	22 ms
snap=5 turns (40 steps)	4.3s	715 MB	14 ms
snap=10 turns (80 steps)	4.1s	388 MB	16 ms
snap=25 turns (200 steps)	4.1s	192 MB	17 ms

snap=None wins on storage (O(N) — only sentinels in blobs, deltas in writes table). Snapshots are still O(N²) total because each snapshot blob captures the full accumulated state. Read time differences are ~2 ms on InMemorySaver — real gap requires Postgres to quantify.

Original benchmark setup (lighter workload)

Workload: create_deep_agent + InMemorySaver, deterministic mock model. Per turn: 1 × 1 KB file write + 1 tool call. Tool results are 2 KB most turns; every 10th turn returns 82 KB which exceeds the 20k-token threshold and is evicted into the files channel by FilesystemMiddleware. Per-turn state growth ≈ 11 KB ≈ ~2.75k tokens (4 chars/token). At N=200 that accumulates to ~550k tokens — a realistic long-running agent thread.
Reproduce: uv run --project libs/deepagents python bench_perf_combo.py (script + results in this PR).
Metrics:
- Elapsed = time.perf_counter() around the N-turn invoke loop (construction excluded).
- Peak memory = tracemalloc.get_traced_memory()[1] — Python allocator peak across the full loop.
- Checkpoint storage = serialized bytes resident in InMemorySaver.{blobs, writes, storage} at end of run (what a durable saver would persist).

Headline: async — baseline → full combo across N

Wall clock

N	~tokens	baseline	combo	speedup
5	14k	1.65s	0.08s	21×
25	70k	8.72s	0.56s	16×
50	140k	17.33s	1.67s	10×
100	280k	37.08s	5.96s	6.2×
200	550k	79.15s	20.37s	3.9×

Peak memory

N	~tokens	baseline	combo	reduction
5	14k	2.3 MB	0.6 MB	3.8×
25	70k	32.5 MB	2.0 MB	16×
50	140k	120.3 MB	3.5 MB	34×
100	280k	467.7 MB	7.0 MB	67×
200	550k	1950 MB	14.6 MB	134×

Checkpoint storage

N	~tokens	baseline	combo	reduction
5	14k	1.3 MB	0.2 MB	7.2×
25	70k	30.1 MB	0.8 MB	40×
50	140k	117.2 MB	1.4 MB	83×
100	280k	461.9 MB	2.9 MB	162×
200	550k	1939 MB	6.2 MB	314×

`durability="exit"`

Wall clock

N	baseline	combo	speedup
25	8.53s	0.28s	30×
100	35.32s	1.19s	30×
200	75.34s	2.61s	29×

(Peak memory and storage reductions are even larger in exit mode because there's only one checkpoint per invoke — see bench_results.json.)

Ablation: who contributes what (async wall-clock)

Five configs, each adds one change cumulatively. Every perf commit is isolated to exactly one step — no bundling.

Config	langchain	langgraph
A baseline	released 1.2.15	released 1.1.9
B1 + DeltaChannel only	`sr/perf-delta-channel-only`	`delta-channel-writes-based`
B2 + tool_call_schema caching (5 langchain-core commits)	`sr/delta-channel-messages`	same as B1
C + `add_messages` fast-path	same as B2	`sr/deepagents-perf-delta-plus-addmsgs`
D + Sends/hydrate + new openai-dict/chars caching + truncate-args dedup	`sr/deepagents-perf-combo`	`sr/deepagents-perf-combo`

Wall clock (s)

N	A	B1	B2	C	D	full speedup
5	1.65	1.51	0.31	0.32	0.08	21×
25	8.72	7.93	1.92	1.96	0.56	16×
50	17.33	16.94	5.06	4.73	1.67	10×
100	37.08	40.28	16.54	12.46	5.96	6.2×
200	79.15	124.13	72.97	35.24	20.37	3.9×

Surprising finding: DeltaChannel alone (B1) is slower than baseline at high N in async mode — 124s vs 79s at N=200. DeltaChannel saves per-step serialization cost but pays an O(N²) load-time cost (each invoke's channels_from_checkpoint walks the delta write history and replays through add_messages). Without the other optimizations to offset that, async becomes net-negative for wall clock. Memory and storage still drop by ~40% in B1 — the slowdown is pure CPU.

Which step delivers what (async elapsed):

A→B1 (DeltaChannel alone): noisy / negative at N≥100 on wall clock. Saves ~40% memory and storage.
B1→B2 (tool_call_schema caching, 5 commits): dominant wall-clock win — N=200 goes 124s → 73s (−41%). Zero memory/storage effect.
B2→C (add_messages fast-path): matters at high N — N=200 goes 73s → 35s (−52%). Zero memory/storage effect.
C→D (Sends/hydrate + today's 3 fixes): big jump — N=200 goes 35s → 20s (−42%). Also kills the remaining 60% of memory/storage.

Peak memory (MB)

N	A	B1	B2	C	D	full reduction
5	2.3	1.8	1.5	1.6	0.6	3.8×
25	32.5	20.9	20.3	20.3	2.0	16×
50	120.3	74.3	74.3	74.3	3.5	34×
100	467.7	284.1	283.5	283.6	7.0	67×
200	1950.0	1191.8	1191.7	1191.6	14.6	134×

Checkpoint storage (MB)

N	A	B1	B2	C	D	full reduction
5	1.3	0.8	0.8	0.8	0.2	7.2×
25	30.1	18.2	18.2	18.2	0.8	40×
50	117.2	70.8	70.8	70.8	1.4	83×
100	461.9	277.4	277.4	277.4	2.9	162×
200	1939.2	1179.2	1179.2	1179.2	6.2	314×

Memory/storage story: DeltaChannel (B1) takes ~40% off; Sends/hydrate (C→D) handles the remaining ~60%. tool_call_schema caching and add_messages are pure CPU — zero effect on these axes.

Individual optimizations (catalog with links)

Every perf change is self-contained on its own upstream branch.

langgraph

Optimization	Branch	Key commit(s)
DeltaChannel channel type + InMemorySaver / PostgresSaver reconstruct path	`delta-channel-writes-based`	Series ending at `afec98f3` (rename to `channels/_delta`)
`add_messages` fast-path (skip left-side conversion, append-only short-circuit)	`optimize/add-messages-fast-path`	`2a974d1b` + 3 follow-ups
`ToolNode` hydrate state from channels via `CONFIG_KEY_READ`	`sr/tool-call-no-state-inline`	`18cbe46b`

langchain

Optimization	Branch	Commit
`AgentState.messages` annotated with DeltaChannel	`sr/perf-delta-channel-only`	`5f2da29a`
`BaseTool.tool_call_schema` + `.args` as `cached_property`	`sr/delta-channel-messages`	`8669f027`
Invalidate cached `tool_call_schema` / `args` on field mutation	same	`f54de971`
`_create_subset_model` → `lru_cache`	same	`63dd915a`
Avoid repeated `tool_call_schema` access in `_format_tool_to_openai_function`	same	`29979178`
Defer tracer imports in `runnables/base.py`	same	`21237799`
`Send("tools", [call])` no longer inlines full state	`sr/tool-call-context-fix`	`22753f68` + `state_keys=` drop fixup
NEW: Cache `_format_tool_to_openai_function(tool)` dict on tool instance	`sr/summarization-tool-schema-cache-spike`	`06e351a4`
NEW: Cache `len(json.dumps(tool_dict))` on tool instance (avoid re-dumping)	same	`43faee1f`

deepagents (this PR)

Optimization	Commit
DeltaChannel applied to messages + files	`1cdebbdc`
Drop `typ=dict` (new DeltaChannel infers from `Annotated`); track `_delta` rename	`9a7e71af`
Pass `tools=[]` to `ToolRuntime` in production middlewares + tests	`e5817dba`
NEW: Reuse token count between `_truncate_args` and `_should_summarize`	`268b8558`

Files touched in this PR

libs/deepagents/pyproject.toml, uv.lock (+ sibling workspace locks): pin langchain/langgraph packages to sr/deepagents-perf-combo.
libs/deepagents/deepagents/middleware/filesystem.py: drop typ=dict kwarg; track channels/_delta rename.
libs/deepagents/deepagents/middleware/summarization.py: return total_tokens from _truncate_args, reuse for _should_summarize.
4 production middlewares + test fixtures: pass tools=[] to ToolRuntime (new required arg introduced by the ToolNode hydrate change).
bench_perf_combo.py: reproducer for the tables above.

Test plan

libs/deepagents unit tests (1181 passed / 0 failed, 1 xfail tracking upstream DeltaChannel + add_messages dedup edge case)
libs/langchain_v1 agents unit tests (2 failures tracked — upstream _DeltaSentinel msgpack serde gap in a test-only saver)
CLI / repl / partner workspaces: locks refreshed, tests pass
bench_perf_combo.py reproduces the tables above locally

🤖 Generated with Claude Code

- AgentState.messages now uses DeltaChannel(add_messages) (via langchain sr/delta-channel-messages) — checkpoint storage drops from O(N²) to O(N) for long-running threads - FilesystemState.files now uses DeltaChannel(_file_data_reducer) directly; removes the now-redundant DeltaFilesystemMiddleware - pyproject.toml sources langchain, langchain-core, langsmith, langgraph, and langgraph-sdk from their respective dev branches via git URLs so testers can install without local clones Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Drop langsmith git override (PyPI is fine). Point langgraph + sdk to diff-channel-incremental-checkpointing instead of sr/deferred-imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Also drops the langchain-core override-dependencies workaround — delta-channel-writes-based uses langchain-core>=1.3.0,<2 (no exact pin). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e Sends Repoints all langchain/langgraph sources to the combined `sr/deepagents-perf-combo` branches, which stack three perf improvements: 1. DeltaChannel for `messages` + `files` (langgraph + langchain AgentState): checkpoint storage drops from O(N²) to O(N) for long threads. 2. `add_messages` fast-path (langgraph): skip left-side conversion and fast-path pure appends on the hot add_messages call. 3. No-state-inline tool dispatch (langchain + langgraph ToolNode): `Send("tools", [tool_call])` no longer carries a serialized snapshot of the full messages list; ToolNode hydrates state from channels via CONFIG_KEY_READ at execution time. Eliminates O(N²) __pregel_tasks growth. Also drops `typ=dict` from the FilesystemState DeltaChannel — the new DeltaChannel infers type from the Annotated outer type. Combo branches: - langchain-ai/langchain#sr/deepagents-perf-combo - langchain-ai/langgraph#sr/deepagents-perf-combo Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-04-23T18:03:24Z

Merging this PR will not alter performance

✅ 32 untouched benchmarks
⏩ 15 skipped benchmarks¹

_{Comparing sr/delta-channel-perf-combo (268b855) with main (f4a2309)}

15 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Standalone deterministic benchmark (mock model, no API calls) that runs create_deep_agent across multiple (N, durability) configs and writes a bench_results.json with wall-clock, tracemalloc peak, and per-store checkpoint storage for each config. Usage: uv run --project libs/deepagents python bench_perf_combo.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Upstream langgraph moved channels/delta.py → channels/_delta.py and removed the public re-export to signal the API is still experimental. Update the FilesystemState import and refresh pinned commits on both combo branches. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ate API The ToolNode hydrate-from-channels change added a required `tools` arg to `ToolRuntime.__init__`; update all test fixtures and the four production middlewares that build a synthetic ToolRuntime (summarization / filesystem / memory / skills) to pass `tools=[]`. Memory + skills default-backend tests now use `agent.get_state(config).values` instead of reading `checkpoint['channel_values'][...]` directly — DeltaChannel stores a sentinel in the raw checkpoint, and get_state resolves it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Run make lock across libs/* and examples/* so every workspace pulls in the renamed langgraph.channels._delta module (otherwise cli / runloop / example lockfiles import from the old channels.delta path and CI fails with ModuleNotFoundError). - Apply ruff format to files touched by the automated tools=[] injection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DeltaChannel(add_messages) replay re-applies each step delta through add_messages, but across-step ID dedup isn't preserved — so an evicted HumanMessage and its replacement both survive reconstruction. The test asserts the replacement alone should win, which is correct behaviour against the baseline reducer. Mark xfail strict=True so we notice once the upstream fix lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Also relax the multi_turn_eviction xfail to strict=False — the bug only reproduces on macOS; Linux CI passes the test under DeltaChannel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous params (30KB×2 writes + 80KB tool result every turn) accumulated ~10M tokens of state at N=200 — an unrealistic stress test. Resized to ~500k tokens at N=200: 1 KB file write every turn, small (2 KB) tool results most turns, with a larger 82 KB result every 10th turn that still triggers FilesystemMiddleware eviction into the `files` channel so the eviction path stays exercised. Also enables N=200 on async mode now that baseline peak fits (was OOMing at the old payload size). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ould_summarize _truncate_args already calls token_counter(messages, tools=...) to decide whether to truncate. The subsequent _should_summarize check was re-running the same count on the same messages and tools — ~10s of duplicate work per 100-turn async run. Have _truncate_args return (messages, modified, total_tokens) and reuse that count downstream. If truncation modifies messages, recount. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Sydney Runkle (sydney-runkle) and others added 4 commits April 22, 2026 17:59

chore: simplify git sources to delta-channel branches only

bbcae94

Drop langsmith git override (PyPI is fine). Point langgraph + sdk to diff-channel-incremental-checkpointing instead of sr/deferred-imports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: switch langgraph source to delta-channel-writes-based

8138643

Also drops the langchain-core override-dependencies workaround — delta-channel-writes-based uses langchain-core>=1.3.0,<2 (no exact pin). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot changed the title ~~perf(deepagents): DeltaChannel + add_messages fast-path + no-inline Sends~~ perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends Apr 23, 2026

github-actions Bot added deepagents Related to the `deepagents` SDK / agent harness dependencies Pull requests that update a dependency file internal User is a member of the `langchain-ai` GitHub organization performance Code change that improves performance size: XS < 50 LOC labels Apr 23, 2026

github-actions Bot added size: M 200-499 LOC and removed size: XS < 50 LOC labels Apr 23, 2026

Mason Daugherty (mdrxy) changed the title ~~perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends~~ perf(sdk): DeltaChannel + add_messages fast-path + no-inline Sends Apr 23, 2026

Sydney Runkle (sydney-runkle) and others added 3 commits April 24, 2026 07:32

Sydney Runkle (sydney-runkle) requested review from Eugene Yurtsev (eyurtsev), Jacob Lee (jacoblee93), Maahir Sachdev (maahir30), Mason Daugherty (mdrxy) and vivek (vtrivedy) as code owners April 24, 2026 11:48

Sydney Runkle (sydney-runkle) and others added 2 commits April 24, 2026 07:50

fix(repl,cli): pass tools=[] to ToolRuntime ctors

b4eec91

Also relax the multi_turn_eviction xfail to strict=False — the bug only reproduces on macOS; Linux CI passes the test under DeltaChannel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added cli Related to `deepagents-cli` repl labels Apr 24, 2026

Sydney Runkle (sydney-runkle) and others added 2 commits April 24, 2026 08:26

Sydney Runkle (sydney-runkle) closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(sdk): `DeltaChannel` + `add_messages` fast-path + no-inline Sends#2910

perf(sdk): `DeltaChannel` + `add_messages` fast-path + no-inline Sends#2910
Sydney Runkle (sydney-runkle) wants to merge 12 commits into
mainfrom
sr/delta-channel-perf-combo

Sydney Runkle (sydney-runkle) commented Apr 23, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Sydney Runkle (sydney-runkle) commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Companion branches

Updated benchmark: heavier realistic workload (LG 1.2.0a1 + LC PR #37101)

N=200 head-to-head

Profiling: what's in the remaining 5.4s (delta + schema cached, N=200)

Snapshot frequency (delta + schema cached, N=200, ~1.23M tokens)

Original benchmark setup (lighter workload)

Headline: async — baseline → full combo across N

Wall clock

Peak memory

Checkpoint storage

durability="exit"

Wall clock

Ablation: who contributes what (async wall-clock)

Wall clock (s)

Peak memory (MB)

Checkpoint storage (MB)

Individual optimizations (catalog with links)

langgraph

langchain

deepagents (this PR)

Files touched in this PR

Test plan

Uh oh!

codspeed-hq Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sydney Runkle (sydney-runkle) commented Apr 23, 2026 •

edited

Loading

`durability="exit"`

codspeed-hq Bot commented Apr 23, 2026 •

edited

Loading