perf: forward-only cursor in monotonicArena.Alloc#3
Open
jensneuse wants to merge 1 commit into
Open
Conversation
Alloc walked a.buffers from index 0 on every call, giving O(numBuffers) cost per Alloc and O(N²) total work over an arena's lifetime. On the Cosmo Router workload reported in #2 (~180MB JSON response, ~600-1200 buffers, ~29M Allocs per request), this dominated request time at ~40s of router-side merge. Track the index of the most recent successful Alloc and start subsequent walks there. Cursor advances on a later-buffer hit and on grow; Reset and Release rewind it to 0 so a reused arena can re-fill its early buffers from scratch. For roughly uniform-size allocations the per-call cost becomes O(1); for mixed sizes the walk is bounded by the number of buffers ahead of the cursor. Benchmarks (controlled prefix, isolated walk cost): prefix=10: 17.5 ns/op → 2.7 ns/op (6.5x) prefix=100: 149 ns/op → 2.6 ns/op (57x) prefix=1000: 1293 ns/op → 2.6 ns/op (497x) Realistic growth workload (AllocCosmoLike): prefix=10: 5125 ns/op → 3.4 ns/op (1500x) prefix=100: 5265 ns/op → 4.0 ns/op (1300x) prefix=1000: 2785 ns/op → 4.0 ns/op (700x) Closes #2. Credit: original analysis and patch by @thoec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2.
monotonicArena.Allocwalkeda.buffersfrom index 0 on every call,giving
O(numBuffers)cost per call andO(N²)total work over thearena's lifetime.
On the Cosmo Router workload reported in the issue (~180MB JSON
response, ~600-1200 buffers, ~29M Alloc calls per request),
this manifested as ~40s of router-side merge time.
This PR adds a forward-only cursor.
Subsequent walks start at
cursorinstead of0.Cursor advances on a later-buffer hit and on grow.
ResetandReleaserewind it to0so a reused arena can re-fillits early buffers from scratch.
For uniform-size allocations the per-call cost becomes
O(1).For mixed sizes the walk is bounded by buffers ahead of the cursor,
with the trade-off that any remaining free space in skipped buffers is
abandoned for the rest of the request.
Benchmarks
Controlled prefix (isolated walk cost):
Pre-fix: clean
O(N)scaling.Post-fix: flat
O(1)regardless of prefix size.Realistic growth workload (
AllocCosmoLike):The realistic workload speedup is larger because the unpatched arena
grows during the timed loop, so the prefix walk gets longer over time.
This is consistent with the reporter's measurement of ~3x end-to-end
on the full Cosmo Router request, where Alloc was the dominant cost.
What changed
monotonic_arena.go: addedcursor intfield;Allocwalks froma.cursor; cursor advances on success and grow;ResetandReleaserewind.monotonic_arena_test.go: 7 unit tests (cursor mechanics + regression test for issue monotonicArena.Alloc scales poorly on large subgraph responses in Cosmo Router #2 hot path) and 2 benchmarks.Credit
Original analysis and patch by @thoec in #2.
This PR adapts the patch and adds the test/benchmark coverage.
Test plan
go test -race ./...passesO(N)→O(1)scaling🤖 Generated with Claude Code