Skip to content

perf: forward-only cursor in monotonicArena.Alloc#3

Open
jensneuse wants to merge 1 commit into
mainfrom
perf/monotonic-arena-alloc-cursor
Open

perf: forward-only cursor in monotonicArena.Alloc#3
jensneuse wants to merge 1 commit into
mainfrom
perf/monotonic-arena-alloc-cursor

Conversation

@jensneuse
Copy link
Copy Markdown
Member

Summary

Closes #2.

monotonicArena.Alloc walked a.buffers from index 0 on every call,
giving O(numBuffers) cost per call and O(N²) total work over the
arena's lifetime.
On the Cosmo Router workload reported in the issue (~180MB JSON
response, ~600-1200 buffers, ~29M Alloc calls per request),
this manifested as ~40s of router-side merge time.

This PR adds a forward-only cursor.
Subsequent walks start at cursor instead of 0.
Cursor advances on a later-buffer hit and on grow.
Reset and Release rewind it to 0 so a reused arena can re-fill
its early buffers from scratch.

For uniform-size allocations the per-call cost becomes O(1).
For mixed sizes the walk is bounded by buffers ahead of the cursor,
with the trade-off that any remaining free space in skipped buffers is
abandoned for the rest of the request.

Benchmarks

Controlled prefix (isolated walk cost):

Prefix Before After Speedup
10 17.5 ns/op 2.7 ns/op 6.5x
100 149 ns/op 2.6 ns/op 57x
1000 1293 ns/op 2.6 ns/op 497x

Pre-fix: clean O(N) scaling.
Post-fix: flat O(1) regardless of prefix size.

Realistic growth workload (AllocCosmoLike):

Prefix Before After Speedup
10 5125 ns/op 3.4 ns/op 1500x
100 5265 ns/op 4.0 ns/op 1300x
1000 2785 ns/op 4.0 ns/op 700x

The realistic workload speedup is larger because the unpatched arena
grows during the timed loop, so the prefix walk gets longer over time.
This is consistent with the reporter's measurement of ~3x end-to-end
on the full Cosmo Router request, where Alloc was the dominant cost.

What changed

Credit

Original analysis and patch by @thoec in #2.
This PR adapts the patch and adds the test/benchmark coverage.

Test plan

  • go test -race ./... passes
  • All 7 new cursor tests pass
  • Benchmarks confirm O(N)O(1) scaling

🤖 Generated with Claude Code

Alloc walked a.buffers from index 0 on every call, giving O(numBuffers)
cost per Alloc and O(N²) total work over an arena's lifetime. On the
Cosmo Router workload reported in #2 (~180MB JSON response, ~600-1200
buffers, ~29M Allocs per request), this dominated request time at ~40s
of router-side merge.

Track the index of the most recent successful Alloc and start subsequent
walks there. Cursor advances on a later-buffer hit and on grow; Reset
and Release rewind it to 0 so a reused arena can re-fill its early
buffers from scratch. For roughly uniform-size allocations the per-call
cost becomes O(1); for mixed sizes the walk is bounded by the number of
buffers ahead of the cursor.

Benchmarks (controlled prefix, isolated walk cost):
  prefix=10:    17.5 ns/op  →  2.7 ns/op  (6.5x)
  prefix=100:    149 ns/op  →  2.6 ns/op  (57x)
  prefix=1000:  1293 ns/op  →  2.6 ns/op  (497x)

Realistic growth workload (AllocCosmoLike):
  prefix=10:    5125 ns/op  →  3.4 ns/op  (1500x)
  prefix=100:   5265 ns/op  →  4.0 ns/op  (1300x)
  prefix=1000:  2785 ns/op  →  4.0 ns/op  (700x)

Closes #2.

Credit: original analysis and patch by @thoec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

monotonicArena.Alloc scales poorly on large subgraph responses in Cosmo Router

1 participant