perf(blockvalidation): cut peak RAM during block validation by ~150 GB#886
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets large, transient allocation spikes during block validation by adding pooling and pre-sizing strategies for several hot-path data structures (validation maps, subtree node slices, and Aerospike setMined batch structures), aiming to reduce peak RSS and GC churn on very large blocks.
Changes:
- Add size-class
sync.Poolreuse for block-validation maps (txMap and parent-spends map) and addClear()/bucket sizing support. - Introduce a size-class pool for per-subtree
[]Nodeallocations, wired in via aNodeAllocatoronmodel.Block, and release pooled slices on cache eviction / failure paths. - Reduce Aerospike
setMinedchurn by reusing[]*aerospike.Keyslices and poolingLuaMapResponseparsing structures.
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
stores/utxo/aerospike/teranode.go |
Adds LuaMapResponse reset + pooling and a pooled parse helper. |
stores/utxo/aerospike/set_mined.go |
Reuses per-Store key slices and pooled Lua map responses in SetMinedMulti. |
stores/utxo/aerospike/aerospike.go |
Extends Store with a per-instance key slice pool. |
services/blockvalidation/pools.go |
New size-class pool for subtree []Node allocations + block node release helper. |
services/blockvalidation/pools_test.go |
Tests for node-slice pooling behavior and concurrency safety. |
services/blockvalidation/BlockValidation.go |
Installs pooled NodeAllocator and releases nodes on eviction/failure paths. |
model/map.go |
Adds per-bucket headroom sizing and Clear()/bucket-count helper for parent-spends map pooling. |
model/Block.go |
Uses pooled txMap/parent-spends maps; wires allocator into subtree deserialization; returns pooled txMap on success. |
model/block_validation_pools.go |
Implements size-class pools for txMap and parent-spends map reuse across blocks. |
model/block_validation_pools_test.go |
Tests pool round-trips and concurrent reuse behavior. |
go.mod |
Bumps go-subtree, go-tx-map, go-safe-conversion, and some indirect test deps. |
go.sum |
Updates module checksums for the bumped dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7089096 to
82bbcfe
Compare
|
🤖 Claude Code Review Status: Complete Summary: This PR implements comprehensive memory pooling across four critical allocation paths in block validation, targeting a ~150 GB peak RAM reduction at 675M-tx scale. The implementation is well-structured with proper separation of concerns and defensive error handling. Findings: ✅ Previously reported issue is FIXED: The txMap resource leak on error paths (comment from Copilot) has been resolved via defer b.releaseTxMap() at model/Block.go:547. The following existing inline comment remains open for author consideration:
Overall Assessment: No new critical issues found. The PR demonstrates careful engineering:
The dependency bumps to go-tx-map v1.3.7 and go-subtree v1.3.3 (both tagged releases) are appropriate. Recommend merging once manual validation on dev-scale-2 confirms the expected memory reduction. |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-18 16:42 UTC |
…path Address PR bsv-blockchain#886 review (Copilot + claude-code-review confirmed Major): the pooled txMap allocated inside checkDuplicateTransactions was only returned to the size-class pool on the success path. Any error between the allocation and the final cleanup (a checkDuplicateTransactions worker error, the txMap.Flush failure, or any validOrderAndBlessed error) returned from Block.Valid with b.txMap still pointing at the multi-GB pooled instance. At 654M-tx scale each leaked map pins ~30 GB of backing storage on the Block until GC, which may not run for a long time if the Block is held by the lastValidatedBlocks cache. Extract the cleanup into a (b *Block).releaseTxMap helper (nil-safe, idempotent, handles disk-backed / pooled-in-memory / generic Closer) and call it via `defer` from Block.Valid immediately before invoking checkDuplicateTransactions. The defer fires on every exit path, so errors no longer leak the map. Also address Copilot Minor on batchKeysPool: document the workload assumption (uniform SetMinedMulti batch sizes within a single Store's lifetime) and explain when size-class bucketing would become necessary.
On dev-scale-2 (675M-tx blocks, ~117 GB serialized, every ~7 min) the
block-validator pod baselined at ~15 GB RSS and spiked to >600 GB
during each block. Live heap was only ~80-100 GB; the gap was
allocation churn, GC pacing, and avoidable map rehash transients.
This change attacks four sources concurrently. Each is independently
revertable; tests cover each in isolation.
1) Pre-size + bucket bump for the swiss maps (model/Block.go,
model/map.go, go-tx-map@e0da75e)
txMap and parentSpendsMap are now constructed with 1.2x per-bucket
headroom inside go-tx-map.NewSplitSwissMapUint64 and
model.NewSplitSyncedParentMap. For N=675M, B=8192 the per-bucket
count is Binomial(N, 1/B) with mean ~82k and sigma ~285 — 20%
headroom absorbs the max-of-B order statistic with a ~250x safety
margin and the underlying dolthub/swiss map no longer triggers a
mid-fill rehash. Live mutex profile on block 1122 showed peak 236
goroutines blocked on SplitSyncedParentMap.SetIfNotExists at
model/map.go b.mu.Lock() — that contention burst correlates 1:1
with the rehash window and disappears once rehash is eliminated.
Bucket count bumped 1024 -> 8192 (txMap) and 4096 -> 8192
(parentSpendsMap) to halve steady-state collision rate at 192
concurrent inserters from ~4.7% to ~2.3%.
2) Pool the maps across blocks (model/block_validation_pools.go,
go-tx-map@e0da75e Clear())
txMap and parentSpendsMap are drawn from size-class-keyed
sync.Pools (4K..1B for tx counts, 16K..4B for parent spends)
so the multi-GB backing storage is reused rather than allocated
fresh per block. Maps are reset via the new Clear() methods on
dolthub/swiss.Map wrappers (SwissMap, SwissMapUint64,
SplitSwissMapUint64, SplitSyncedParentMap) — Clear zeroes the
ctrl/metadata in place without releasing the (often multi-GB)
group arrays.
Network-load shifts (1K <-> 1M subtree size) are handled by
sync.Pool's natural per-GC drainage: classes that stop being Got
age out automatically.
3) Pool the per-subtree []Node arrays (services/blockvalidation/
pools.go, go-subtree@54b2484
DeserializeFromReaderWithAllocator/ReleaseNodes)
For a 675M-tx block, ~620 subtrees x ~1M nodes x 48 B per Node =
~30 GB allocated and freed per block. go-subtree exposes a new
DeserializeFromReaderWithAllocator that takes a caller-supplied
NodeAllocator; teranode installs a size-class-bucketed pool
(1K/4K/16K/64K/256K/1M/4M Node capacities) and releases the slice
back to the pool on cache eviction or any validation/storage
failure path. The existing DeserializeFromReader,
deserializeNodes, and deserializeFromReaderMmap paths are
unchanged.
4) Aerospike setMined per-record pools (stores/utxo/aerospike/
set_mined.go, aerospike.go, teranode.go)
After validation, every tx in the accepted block is marked mined
via Aerospike batch UDFs. Live alloc profile attributed ~975 GB
cumulative allocation per block to this path. Two safe in-repo
pools land here:
- aerospike.Key reuse via a per-Store sync.Pool. Per-Store
because aerospike.Key.SetValue recomputes the digest using the
Key's existing (unexported) setName — a package-level pool
would cross-contaminate stores with different setNames and
produce wrong digests.
- LuaMapResponse pool with Reset() and a private
parseLuaMapResponseInto sink so the parser fills a
caller-owned struct. The existing public ParseLuaMapResponse
remains an allocating thin wrapper so the 12+ unrelated
callers in locked.go / alert_system.go / un_spend.go etc. are
untouched.
The aerospike-client-go-side packer/unpacker buffer pool (the
single largest churn driver, ~395 GB cumulative parseRecord) is
intentionally NOT in this PR — teranode main pins upstream
v8.2.0 and that fix requires the BSV fork. It will land
separately on the feat/teranode-native-ops (PR 828) branch
which already uses the fork.
Benchmark: BenchmarkSplitSwissMapUint64_NoRehash (1M inserts) drops
from doubling-grow churn to 1 B/op, 0 allocs/op after construction,
confirming no mid-fill rehash.
Tests: go test -race -tags testtxmetacache ./model/... and the
targeted ./services/blockvalidation/... runs pass. The unrelated
DiskParentSpendsMap.Stats / writerLoop race surfaced under
block_diskMapDirs (set via local settings_local.conf) exists on
plain main too and is filed separately.
Fork branches:
- github.com/bsv-blockchain/go-tx-map feat/presize-headroom
(1.2x headroom + Clear methods)
- github.com/bsv-blockchain/go-subtree feat/node-allocator
(DeserializeFromReaderWithAllocator + ReleaseNodes)
Both pinned via pseudo-version in go.mod; bump to a tagged release
before merging.
…path Address PR bsv-blockchain#886 review (Copilot + claude-code-review confirmed Major): the pooled txMap allocated inside checkDuplicateTransactions was only returned to the size-class pool on the success path. Any error between the allocation and the final cleanup (a checkDuplicateTransactions worker error, the txMap.Flush failure, or any validOrderAndBlessed error) returned from Block.Valid with b.txMap still pointing at the multi-GB pooled instance. At 654M-tx scale each leaked map pins ~30 GB of backing storage on the Block until GC, which may not run for a long time if the Block is held by the lastValidatedBlocks cache. Extract the cleanup into a (b *Block).releaseTxMap helper (nil-safe, idempotent, handles disk-backed / pooled-in-memory / generic Closer) and call it via `defer` from Block.Valid immediately before invoking checkDuplicateTransactions. The defer fires on every exit path, so errors no longer leak the map. Also address Copilot Minor on batchKeysPool: document the workload assumption (uniform SetMinedMulti batch sizes within a single Store's lifetime) and explain when size-class bucketing would become necessary.
6ff34e8 to
8d71ebe
Compare
…nt-vs-empty semantic processBatchResultsForSetMined checks `res.BlockIDs != nil` to decide whether to insert the per-tx blockIDs into the returned map. The contract observed by that check is "BlockIDs != nil iff the Lua response actually contained a blockIDs field" — parseLuaMapResponseInto only writes BlockIDs when respMap["blockIDs"] is present, so error responses (e.g. TX_NOT_FOUND) leave BlockIDs untouched. The pre-pool code allocated a fresh LuaMapResponse per record, so BlockIDs started as nil and that contract held trivially. The pooled Reset was truncating BlockIDs to [:0] instead of nil-ing it, to preserve the slice capacity across calls. That violated the contract: after a successful record set BlockIDs = [333], the next record's Reset left it as a non-nil empty slice, and the caller then inserted the failed record's hash into the returned map with an empty value. Fixes TestAerospike/aerospike_setmined_multi_partial_failure caught by CI on PR bsv-blockchain#886. The slice-cap reuse was a minor optimization not worth violating the public-facing semantic; per-record BlockIDs are small (typically O(1) for an unforked chain) so the allocation cost is negligible. Errors map is still cleared-not-nil-ed because the loop body checks res.Signal and other scalar fields rather than relying on r.Errors == nil.
|
…path Address PR #886 review (Copilot + claude-code-review confirmed Major): the pooled txMap allocated inside checkDuplicateTransactions was only returned to the size-class pool on the success path. Any error between the allocation and the final cleanup (a checkDuplicateTransactions worker error, the txMap.Flush failure, or any validOrderAndBlessed error) returned from Block.Valid with b.txMap still pointing at the multi-GB pooled instance. At 654M-tx scale each leaked map pins ~30 GB of backing storage on the Block until GC, which may not run for a long time if the Block is held by the lastValidatedBlocks cache. Extract the cleanup into a (b *Block).releaseTxMap helper (nil-safe, idempotent, handles disk-backed / pooled-in-memory / generic Closer) and call it via `defer` from Block.Valid immediately before invoking checkDuplicateTransactions. The defer fires on every exit path, so errors no longer leak the map. Also address Copilot Minor on batchKeysPool: document the workload assumption (uniform SetMinedMulti batch sizes within a single Store's lifetime) and explain when size-class bucketing would become necessary.
…nt-vs-empty semantic processBatchResultsForSetMined checks `res.BlockIDs != nil` to decide whether to insert the per-tx blockIDs into the returned map. The contract observed by that check is "BlockIDs != nil iff the Lua response actually contained a blockIDs field" — parseLuaMapResponseInto only writes BlockIDs when respMap["blockIDs"] is present, so error responses (e.g. TX_NOT_FOUND) leave BlockIDs untouched. The pre-pool code allocated a fresh LuaMapResponse per record, so BlockIDs started as nil and that contract held trivially. The pooled Reset was truncating BlockIDs to [:0] instead of nil-ing it, to preserve the slice capacity across calls. That violated the contract: after a successful record set BlockIDs = [333], the next record's Reset left it as a non-nil empty slice, and the caller then inserted the failed record's hash into the returned map with an empty value. Fixes TestAerospike/aerospike_setmined_multi_partial_failure caught by CI on PR #886. The slice-cap reuse was a minor optimization not worth violating the public-facing semantic; per-record BlockIDs are small (typically O(1) for an unforked chain) so the allocation cost is negligible. Errors map is still cleared-not-nil-ed because the loop body checks res.Signal and other scalar fields rather than relying on r.Errors == nil.
Trivial: origin/main now contains PR #886 (squash-merged as 74e933a). Our branch already cherry-picked the same content earlier, so the three file-level conflicts in stores/utxo/aerospike/ (aerospike.go, set_mined.go, teranode.go) are PR #886's changes-vs-themselves-plus-PR-#828-extras. Kept HEAD in each case — it has both the cherry-picked PR #886 work and the PR #828 additions (useNativeTeranodeOps field on the Store, richer batchRecord diagnostics in processSingleBatchRecordPooled, luaResponse* helpers in parseLuaMapResponseInto for native-op response shapes).



Summary
On dev-scale-2 (675M-tx blocks, ~117 GB serialized, every ~7 min) the `block-validator` pod baselined at ~15 GB RSS and spiked to >600 GB during each block. Live heap during validation was only ~80-100 GB; the gap was allocation churn, GC pacing, and avoidable map rehash transients.
This PR attacks four sources concurrently. Each track is independently revertable; tests cover each in isolation.
1. Pre-size + bucket bump for the swiss maps
`txMap` and `parentSpendsMap` are now constructed with 1.2× per-bucket headroom inside `go-tx-map.NewSplitSwissMapUint64` and `model.NewSplitSyncedParentMap`. For `N=675M, B=8192` the per-bucket count is Binomial(N, 1/B) with mean ~82K and σ ~285 — 20% headroom absorbs the max-of-B order statistic with a ~250× safety margin and the underlying `dolthub/swiss` map no longer triggers a mid-fill rehash.
Live mutex profile on block 1122 showed peak 236 goroutines blocked on `SplitSyncedParentMap.SetIfNotExists` at `model/map.go b.mu.Lock()` — that contention burst correlates 1:1 with the rehash window and disappears once rehash is eliminated.
Bucket count bumped 1024 → 8192 (txMap) and 4096 → 8192 (parentSpendsMap) to halve steady-state collision rate at 192 concurrent inserters from ~4.7% → ~2.3%.
2. Pool the maps across blocks
`txMap` and `parentSpendsMap` are drawn from size-class-keyed `sync.Pool`s (4K..1B for tx counts, 16K..4B for parent spends) so the multi-GB backing storage is reused rather than allocated fresh per block. Maps are reset via the new `Clear()` methods on `dolthub/swiss.Map` wrappers — Clear zeroes the ctrl/metadata in place without releasing the (often multi-GB) group arrays.
Network-load shifts (1K ↔ 1M subtree size) are handled by `sync.Pool`'s natural per-GC drainage: classes that stop being `Get`ed age out automatically.
3. Pool the per-subtree []Node arrays
For a 675M-tx block, ~620 subtrees × ~1M nodes × 48 B per `Node` = ~30 GB allocated and freed per block. `go-subtree` exposes a new `DeserializeFromReaderWithAllocator` that takes a caller-supplied `NodeAllocator`; teranode installs a size-class-bucketed pool (1K/4K/16K/64K/256K/1M/4M Node capacities) and releases the slice back to the pool on cache eviction or any validation/storage failure path.
The existing `DeserializeFromReader`, `deserializeNodes`, and `deserializeFromReaderMmap` paths are byte-for-byte unchanged.
4. Aerospike setMined per-record pools
After validation, every tx in the accepted block is marked mined via Aerospike batch UDFs. Live alloc profile attributed ~975 GB cumulative allocation per block to this path. Two safe in-repo pools land here:
The aerospike-client-go-side packer/unpacker buffer pool (the single largest churn driver) is intentionally not in this PR — main pins upstream `v8.2.0` and that fix requires the bsv fork. It lands separately on the `feat/teranode-native-ops` (PR #828) branch which already uses the fork. See bsv-blockchain/aerospike-client-go#3.
Dependent fork PRs
Both go-tx-map and go-subtree are pinned in this PR via pseudo-version. Bump to a tagged release before merging.
Estimated RAM impact at 675M-tx scale
The packer pool (when PR 828 merges) adds another ~30 GB `bytes.Buffer.grow` saving.
Test plan
Known follow-ups (out of this PR)
Cherry-pick into PR #828
This PR is designed to merge cleanly into `feat/teranode-native-ops` (PR #828). The only meaningful conflict is in `stores/utxo/aerospike/` where 828 uses the renamed bsv fork imports — trivial path-level conflict.