Skip to content

perf(blockassembly): adopt packed TxInpoints + zero-alloc columnar path#889

Merged
icellan merged 6 commits into
mainfrom
perf/txinpoints-packed-layout
May 19, 2026
Merged

perf(blockassembly): adopt packed TxInpoints + zero-alloc columnar path#889
icellan merged 6 commits into
mainfrom
perf/txinpoints-packed-layout

Conversation

@icellan
Copy link
Copy Markdown
Contributor

@icellan icellan commented May 18, 2026

Summary

Bumps go-subtree to v1.4.0 (packed-layout TxInpoints) and reworks both block-assembly receive paths to consume the new shape with minimal heap churn.

Block-assembly is the single non-horizontally-scalable service in the cluster, so every per-tx allocation eliminated there is amortised across every TPS-second.

What go-subtree v1.4.0 brings

  • TxInpoints stores per-parent vouts in a single count-prefixed []uint32 (was nested [][]uint32) — one fewer alloc per parent.
  • NewTxInpoints sizes via len(tx.Inputs), preserving the no-grow guarantee the prior cap-8 / cap-16 was aiming for without paying for unused slack on typical 1-2-input txs.
  • NewTxInpointsFromPacked(parents, voutIdxs) aliases caller-owned slices into a TxInpoints with zero allocation and zero validation — the hot-path constructor used by columnar gRPC.
  • The public Idxs field is removed (rather than renamed) so external code on the old API fails to compile rather than silently misuse the new layout.

What this PR does in teranode

1. AddTxBatchColumnar wire-format bump (proto)

  • Replaced parent_vout_indices + vout_idx_offsets (per-parent) with vout_idxs_packed + vout_idxs_tx_offsets (per-tx).
  • The packed buffer is byte-identical to TxInpoints.voutIdxs's internal layout, so the Server aliases a per-tx slice straight into a TxInpoints via NewTxInpointsFromPacked.
  • Parent hashes reinterpreted with unsafe.Slice once per batch (chainhash.Hash is [32]byte, byte-aligned over []byte backing).
  • Per-tx allocations in the columnar handler: ~4 → 0.

2. AddTxBatch (row-oriented) handler

  • Batch-allocate []TxInpoints values once and take addresses into txInpointsList, instead of escaping a fresh TxInpoints per tx.
  • Per-tx allocations: ~2 → 0.

3. Client.go columnar emit

  • Build vout_idxs_packed as [count_0, vals..., count_1, vals..., ...] directly, ready for zero-decode consumption.

4. Test migrations (17 files)

  • Construction sites now use NewTxInpointsFromInputs with fake *bt.Input values; read sites use GetParentVoutsAtIndex.
  • server_columnar_test.go rewritten end-to-end against the new proto shape including a new validator for vout_idxs_tx_offsets length.

Expected impact at 1 M+ TPS

With UseColumnarBatch=true and the packed layout enabled:

Metric Before After
Server-side allocs/sec (receive path) ~4 M ~3 K
GC-tracked objects in receive path ~1.2 B at 600 M live txs ~1.2 M (≈1000× reduction)
Estimated CPU saved on block-assembly ~20-25 % (dominated by GC mark)
Wire bytes per 1-input tx ~110 B ~95 B

The CPU win comes mostly from collapsing per-tx slice objects in currentTxMap into shared per-batch buffers that GC counts as one object regardless of how many transactions alias into them.

Deploy notes

  • The AddTxBatchColumnar wire-format change is internal between Validator and block-assembly. Validators and block-assembly must deploy together.
  • UseColumnarBatch is still false by default — flip it per environment after the lock-step deploy.
  • The row-oriented AddTxBatch path stays fully compatible.

Test plan

  • go build ./...
  • go vet ./... clean (modulo pre-existing test/utils/ lock-by-value warnings)
  • go test -race ./services/blockassembly/... ./services/asset/httpimpl/... ./stores/utxo/meta/... ./stores/utxo/ ./stores/txmetacache/... ./services/validator/... ./model/... — all green
  • golangci-lint run clean on touched files (only pre-existing prealloc warnings remain in unrelated benchmarks)
  • Smoke test in scale-2 (toggle UseColumnarBatch=true after deploy)

Related

Bumps go-subtree to v1.4.0 (packed-layout TxInpoints) and reworks the
block-assembly receive paths to consume the new shape with minimal heap
churn. The block-assembly Server is the single non-horizontally-scalable
service in the cluster, so every per-tx allocation eliminated there
pays off proportionally to ingestion rate.

go-subtree v1.4.0 changes (consumed here):
  - TxInpoints stores per-parent vouts in a single count-prefixed
    []uint32 (was nested [][]uint32) — drops one alloc per parent.
  - NewTxInpoints sizes via len(tx.Inputs), preserves the no-grow
    guarantee the prior cap-8/cap-16 was aiming for without paying for
    unused slack on typical 1-2-input txs.
  - NewTxInpointsFromPacked aliases caller-owned (parents, voutIdxs)
    slices into a TxInpoints with zero allocation and zero validation.
  - Public field Idxs removed; callers use GetParentVoutsAtIndex.

teranode changes:

1. AddTxBatchColumnar wire format (proto bump):
   - Replaced parent_vout_indices + vout_idx_offsets (per-parent) with
     vout_idxs_packed + vout_idxs_tx_offsets (per-tx). The packed
     buffer is byte-identical to the layout block-assembly stores
     internally, so the Server aliases a per-tx slice straight into a
     TxInpoints via NewTxInpointsFromPacked. Two slice ops per tx,
     zero allocation.
   - Parent hashes reinterpreted with unsafe.Slice once per batch
     (chainhash.Hash is [32]byte, byte-aligned over []byte backing).
   - Per-tx allocations in the columnar handler: ~4 -> 0.

2. AddTxBatch (row-oriented) handler:
   - Batch-allocate []TxInpoints values once and take addresses into
     txInpointsList, instead of escaping a fresh TxInpoints per tx.
   - Per-tx allocations: ~2 -> 0.

3. Client.go columnar emit:
   - Build vout_idxs_packed as [count_0, vals..., count_1, vals..., ...]
     directly, ready for zero-decode consumption.

Test migrations:
   - 17 test files moved off the removed Idxs field. Construction sites
     now use NewTxInpointsFromInputs with fake *bt.Input values; read
     sites use GetParentVoutsAtIndex.
   - server_columnar_test.go rewritten end-to-end against the new
     proto shape including a new validator for vout_idxs_tx_offsets
     length.

Expected at 1M+ TPS (UseColumnarBatch=true, packed layout):
   - Server-side allocs/sec: ~4M -> ~3K (-99.9 %)
   - GC-tracked objects in receive path: ~1000x reduction (per-tx
     TxInpoints inner slices replaced with batch-shared buffers
     aliased into per-tx slice headers).
   - Estimated ~20-25 % CPU saved on block-assembly, dominated by
     reduced GC mark cost.

Wire-protocol bump is internal between Validator and block-assembly;
deploy in lock-step.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

🤖 Claude Code Review

Status: Complete


Current Review:
Both previously flagged issues have been addressed:

  • Offset validation (Server.go:1094-1139): Comprehensive bounds checking prevents slice-out-of-range panics. Validates start positions, endpoints, and monotonicity for both offset arrays. Covered by regression tests.

  • Proto field reservation (blockassembly_api.proto:151-152): Fields 8 and 9 are properly reserved to prevent silent data corruption during version skew. New fields use tags 10 and 11.

The implementation demonstrates careful attention to performance and safety:

  • Zero-allocation columnar deserialization with proper validation
  • Extensive test coverage including malformed input cases
  • Well-documented unsafe pointer usage with clear lifetime assumptions
  • Appropriate use of field reservations to fail-fast on version mismatches

No additional issues found.

@github-actions
Copy link
Copy Markdown
Contributor

Critical Issue Found in Server.go:1089

[Critical] Missing bounds validation for offset arrays

The validation checks that ParentTxOffsets and VoutIdxsTxOffsets have the correct length (txCount+1), but does not verify that the offset values themselves are within bounds of their respective data arrays.

Potential panic scenarios at lines 1131-1132 if offset values exceed array bounds.

See services/blockassembly/Server.go:1089-1132 for details.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades go-subtree to v1.4.0 and adapts block-assembly ingestion paths and tests to the new packed TxInpoints layout, with the main goal of reducing allocations in block-assembly receive paths.

Changes:

  • Updates columnar block-assembly gRPC format to send packed vout indices and reconstruct TxInpoints without per-transaction decoding.
  • Optimizes row-oriented batch ingestion by storing TxInpoints values in a batch slice.
  • Migrates tests and helpers away from the removed TxInpoints.Idxs field.

Reviewed changes

Copilot reviewed 25 out of 27 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
go.mod Bumps go-subtree and go-safe-conversion versions.
go.sum Adds checksums for updated dependencies.
services/blockassembly/Server.go Reworks row and columnar AddTxBatch receive paths for packed inpoints.
services/blockassembly/Client.go Emits count-prefixed packed vout indices for columnar batches.
services/blockassembly/blockassembly_api/blockassembly_api.proto Updates columnar request fields/documentation for packed vouts.
services/blockassembly/blockassembly_api/blockassembly_api.pb.go Regenerates Go protobuf bindings for the updated request shape.
services/blockassembly/server_columnar_test.go Migrates columnar tests to the new packed format.
services/blockassembly/inpoints_helper_test.go Adds a test helper for single-parent TxInpoints.
services/blockassembly/server_test.go Updates test inpoint construction to new APIs.
services/blockassembly/filter_transactions_test.go Replaces direct Idxs literals with helper-built inpoints.
services/blockassembly/data_test.go Updates serialization tests for packed inpoint construction.
services/blockassembly/parent_validation_ordering_test.go Builds parent inpoints via fake inputs.
services/blockassembly/subtreeprocessor/SubtreeProcessor_test.go Updates subtree metadata test inpoint construction.
services/blockassembly/subtreeprocessor/disk_tx_map_test.go Migrates disk tx map serialization test to packed inpoints.
services/blockassembly/subtreeprocessor/conflicting_queue_race_test.go Updates race test inpoints helper usage.
services/validator/Server_test.go Adds shared validator test helper for single-parent inpoints.
services/validator/Server_coverage_test.go Reuses validator inpoint helper in coverage tests.
services/asset/httpimpl/test.go Builds transaction metadata inpoints via NewTxInpointsFromInputs.
services/asset/httpimpl/GetTransactionMeta_test.go Replaces static inpoint literal with constructor helper.
stores/utxo/unmined_serialization_test.go Updates unmined transaction serialization fixtures.
stores/utxo/meta/data_test.go Adds test helpers and updates metadata tests/benchmarks.
stores/txmetacache/txmetacache_test.go Updates TxMetaCache assertions and fixtures for packed inpoints.
model/TestHelper.go Builds generated block subtree metadata using new inpoint constructor.
model/Block_test.go Migrates subtree metadata test construction to fake inputs.
test/longtest/services/blockassembly/subtreeprocessor/SubtreeProcessor_test.go Updates long-test subtree inpoints setup.
test/longtest/services/blockassembly/Server_test.go Updates performance-test serialized inpoints setup.
test/longtest/model/BlockBig_test.go Adjusts expected empty TxInpoints value.
Files not reviewed (1)
  • services/blockassembly/blockassembly_api/blockassembly_api.pb.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread services/blockassembly/Server.go
Comment thread services/blockassembly/blockassembly_api/blockassembly_api.proto Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-889 (5e0e4a3)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 142
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.661µ 1.681µ ~ 0.100
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.79n 61.84n ~ 1.000
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 62.15n 61.86n ~ 0.300
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.76n 61.70n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 29.33n 30.44n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 49.67n 50.38n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 108.0n 112.4n ~ 0.100
MiningCandidate_Stringify_Short-4 268.5n 262.5n ~ 0.100
MiningCandidate_Stringify_Long-4 1.869µ 1.870µ ~ 0.700
MiningSolution_Stringify-4 954.1n 963.0n ~ 0.100
BlockInfo_MarshalJSON-4 1.754µ 1.754µ ~ 0.800
NewFromBytes-4 94.19n 104.60n ~ 0.700
Mine_EasyDifficulty-4 67.47µ 67.39µ ~ 0.400
Mine_WithAddress-4 7.343µ 7.329µ ~ 1.000
BlockAssembler_AddTx-4 0.02916n 0.02854n ~ 1.000
AddNode-4 10.50 10.53 ~ 1.000
AddNodeWithMap-4 11.45 10.88 ~ 0.200
DirectSubtreeAdd/4_per_subtree-4 55.44n 60.58n ~ 0.100
DirectSubtreeAdd/64_per_subtree-4 30.01n 28.96n ~ 0.200
DirectSubtreeAdd/256_per_subtree-4 27.76n 27.84n ~ 0.700
DirectSubtreeAdd/1024_per_subtree-4 26.41n 26.57n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 25.98n 26.10n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 293.4n 291.7n ~ 0.400
SubtreeProcessorAdd/64_per_subtree-4 285.8n 292.6n ~ 0.100
SubtreeProcessorAdd/256_per_subtree-4 282.2n 303.6n ~ 0.100
SubtreeProcessorAdd/1024_per_subtree-4 277.0n 280.0n ~ 0.200
SubtreeProcessorAdd/2048_per_subtree-4 278.5n 280.4n ~ 0.700
SubtreeProcessorRotate/4_per_subtree-4 285.5n 283.6n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 284.6n 290.0n ~ 0.700
SubtreeProcessorRotate/256_per_subtree-4 280.7n 289.1n ~ 0.100
SubtreeProcessorRotate/1024_per_subtree-4 283.4n 279.5n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 54.98n 54.95n ~ 1.000
SubtreeNodeAddOnly/64_per_subtree-4 36.05n 35.95n ~ 0.700
SubtreeNodeAddOnly/256_per_subtree-4 35.04n 35.04n ~ 1.000
SubtreeNodeAddOnly/1024_per_subtree-4 34.49n 34.28n ~ 0.100
SubtreeCreationOnly/4_per_subtree-4 110.9n 112.3n ~ 0.200
SubtreeCreationOnly/64_per_subtree-4 351.6n 355.4n ~ 1.000
SubtreeCreationOnly/256_per_subtree-4 1.234µ 1.232µ ~ 0.700
SubtreeCreationOnly/1024_per_subtree-4 3.837µ 3.797µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 6.934µ 6.820µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 303.2n 287.0n ~ 0.100
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 348.1n 291.0n ~ 0.100
ParallelGetAndSetIfNotExists/1k_nodes-4 2.027m 2.012m ~ 1.000
ParallelGetAndSetIfNotExists/10k_nodes-4 5.270m 5.205m ~ 0.200
ParallelGetAndSetIfNotExists/50k_nodes-4 7.463m 7.551m ~ 1.000
ParallelGetAndSetIfNotExists/100k_nodes-4 11.15m 10.15m ~ 0.100
SequentialGetAndSetIfNotExists/1k_nodes-4 1.805m 1.781m ~ 0.700
SequentialGetAndSetIfNotExists/10k_nodes-4 4.789m 4.602m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 14.20m 14.54m ~ 1.000
SequentialGetAndSetIfNotExists/100k_nodes-4 25.75m 26.31m ~ 0.400
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 2.046m 2.065m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 8.433m 8.543m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 13.49m 13.68m ~ 0.400
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 1.849m 1.821m ~ 1.000
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 8.916m 8.146m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 44.82m 45.02m ~ 0.700
DiskTxMap_SetIfNotExists-4 3.847µ 3.886µ ~ 0.700
DiskTxMap_SetIfNotExists_Parallel-4 3.587µ 3.534µ ~ 0.400
DiskTxMap_ExistenceOnly-4 321.0n 338.5n ~ 0.100
Queue-4 188.1n 189.4n ~ 0.200
AtomicPointer-4 3.684n 3.239n ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/10K-4 803.0µ 807.9µ ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/10K-4 762.8µ 763.4µ ~ 1.000
ReorgOptimizations/AllMarkFalse/Old/10K-4 102.8µ 104.9µ ~ 0.700
ReorgOptimizations/AllMarkFalse/New/10K-4 64.19µ 64.06µ ~ 1.000
ReorgOptimizations/HashSlicePool/Old/10K-4 50.05µ 49.94µ ~ 0.700
ReorgOptimizations/HashSlicePool/New/10K-4 11.02µ 10.95µ ~ 0.100
ReorgOptimizations/NodeFlags/Old/10K-4 4.714µ 4.332µ ~ 0.100
ReorgOptimizations/NodeFlags/New/10K-4 1.724µ 1.540µ ~ 0.100
ReorgOptimizations/DedupFilterPipeline/Old/100K-4 9.563m 9.531m ~ 1.000
ReorgOptimizations/DedupFilterPipeline/New/100K-4 9.729m 10.147m ~ 0.100
ReorgOptimizations/AllMarkFalse/Old/100K-4 1.068m 1.103m ~ 0.100
ReorgOptimizations/AllMarkFalse/New/100K-4 704.9µ 703.3µ ~ 0.400
ReorgOptimizations/HashSlicePool/Old/100K-4 473.5µ 497.0µ ~ 0.200
ReorgOptimizations/HashSlicePool/New/100K-4 197.0µ 197.0µ ~ 1.000
ReorgOptimizations/NodeFlags/Old/100K-4 48.96µ 48.08µ ~ 0.100
ReorgOptimizations/NodeFlags/New/100K-4 16.83µ 16.73µ ~ 0.700
TxMapSetIfNotExists-4 46.26n 49.71n ~ 0.100
TxMapSetIfNotExistsDuplicate-4 38.74n 42.57n ~ 0.100
ChannelSendReceive-4 574.3n 576.9n ~ 0.700
CalcBlockWork-4 499.8n 470.4n ~ 0.700
CalculateWork-4 624.4n 672.4n ~ 0.200
BuildBlockLocatorString_Helpers/Size_10-4 1.361µ 1.385µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_100-4 13.51µ 13.06µ ~ 0.700
BuildBlockLocatorString_Helpers/Size_1000-4 130.8µ 162.1µ ~ 0.700
CatchupWithHeaderCache-4 104.7m 104.5m ~ 0.700
_BufferPoolAllocation/16KB-4 3.813µ 4.018µ ~ 0.100
_BufferPoolAllocation/32KB-4 8.995µ 9.407µ ~ 0.400
_BufferPoolAllocation/64KB-4 18.42µ 16.03µ ~ 0.100
_BufferPoolAllocation/128KB-4 36.20µ 25.59µ ~ 0.100
_BufferPoolAllocation/512KB-4 96.89µ 101.34µ ~ 0.400
_BufferPoolConcurrent/32KB-4 18.62µ 18.50µ ~ 1.000
_BufferPoolConcurrent/64KB-4 29.52µ 31.31µ ~ 0.100
_BufferPoolConcurrent/512KB-4 144.0µ 143.0µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/16KB-4 623.4µ 627.1µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/32KB-4 624.8µ 622.5µ ~ 1.000
_SubtreeDeserializationWithBufferSizes/64KB-4 613.5µ 616.1µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 601.8µ 598.3µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/512KB-4 617.2µ 606.8µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 36.63m 36.15m ~ 0.700
_SubtreeDataDeserializationWithBufferSizes/32KB-4 36.70m 36.24m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 36.52m 36.21m ~ 0.200
_SubtreeDataDeserializationWithBufferSizes/128KB-4 36.53m 36.01m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/512KB-4 36.06m 35.96m ~ 1.000
_PooledVsNonPooled/Pooled-4 740.1n 737.9n ~ 0.100
_PooledVsNonPooled/NonPooled-4 7.805µ 7.546µ ~ 0.200
_MemoryFootprint/Current_512KB_32concurrent-4 6.610µ 6.688µ ~ 0.400
_MemoryFootprint/Proposed_32KB_32concurrent-4 9.303µ 9.522µ ~ 0.100
_MemoryFootprint/Alternative_64KB_32concurrent-4 9.119µ 9.712µ ~ 0.100
SubtreeSizes/10k_tx_4_per_subtree-4 1.380m 1.396m ~ 0.400
SubtreeSizes/10k_tx_16_per_subtree-4 322.6µ 322.1µ ~ 0.700
SubtreeSizes/10k_tx_64_per_subtree-4 77.18µ 76.68µ ~ 1.000
SubtreeSizes/10k_tx_256_per_subtree-4 19.15µ 19.09µ ~ 1.000
SubtreeSizes/10k_tx_512_per_subtree-4 9.408µ 9.528µ ~ 0.200
SubtreeSizes/10k_tx_1024_per_subtree-4 4.652µ 4.734µ ~ 0.200
SubtreeSizes/10k_tx_2k_per_subtree-4 2.315µ 2.355µ ~ 0.100
BlockSizeScaling/10k_tx_64_per_subtree-4 74.57µ 74.58µ ~ 0.700
BlockSizeScaling/10k_tx_256_per_subtree-4 18.80µ 18.96µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.638µ 4.762µ ~ 0.100
BlockSizeScaling/50k_tx_64_per_subtree-4 395.3µ 393.9µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 93.20µ 95.44µ ~ 0.200
BlockSizeScaling/50k_tx_1024_per_subtree-4 22.86µ 23.06µ ~ 0.400
SubtreeAllocations/small_subtrees_exists_check-4 156.2µ 161.1µ ~ 0.100
SubtreeAllocations/small_subtrees_data_fetch-4 166.7µ 170.6µ ~ 0.400
SubtreeAllocations/small_subtrees_full_validation-4 321.6µ 328.6µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 9.253µ 9.434µ ~ 0.100
SubtreeAllocations/medium_subtrees_data_fetch-4 9.389µ 9.888µ ~ 0.100
SubtreeAllocations/medium_subtrees_full_validation-4 18.46µ 19.02µ ~ 0.100
SubtreeAllocations/large_subtrees_exists_check-4 2.191µ 2.274µ ~ 0.100
SubtreeAllocations/large_subtrees_data_fetch-4 2.348µ 2.409µ ~ 0.100
SubtreeAllocations/large_subtrees_full_validation-4 4.630µ 4.793µ ~ 0.100
_prepareTxsPerLevel-4 391.3m 388.8m ~ 1.000
_prepareTxsPerLevelOrdered-4 4.897m 4.133m ~ 0.100
_prepareTxsPerLevel_Comparison/Original-4 400.3m 386.5m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 4.603m 4.075m ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 333.2µ 332.6µ ~ 1.000
StoreBlock_Sequential/AboveCSVHeight-4 333.7µ 334.4µ ~ 1.000
GetUtxoHashes-4 276.8n 277.6n ~ 1.000
GetUtxoHashes_ManyOutputs-4 45.42µ 45.53µ ~ 0.400
_NewMetaDataFromBytes-4 232.0n 213.3n ~ 0.100
_Bytes-4 610.2n 395.4n ~ 0.100
_MetaBytes-4 560.8n 336.3n ~ 0.100

Threshold: >10% with p < 0.05 | Generated: 2026-05-19 10:44 UTC

icellan added a commit that referenced this pull request May 18, 2026
PR #889 bumped go-subtree to v1.4.0, which made SubtreeMeta.Serialize
reject empty TxInpoints on non-coinbase entries with "parent tx
hashes are not set for node N". The test helper still seeded every
node with subtreepkg.NewTxInpoints() (no parents), so
TestBlock_ValidateSubtree_MissingParents/multiple_invalid_parents and
TestBlock_ValidateSubtree_NodeIteration/subtree_with_multiple_nodes_iteration
broke as soon as they used the multi-node form.

Helper now assigns a single dummy parent to every node except a
coinbase placeholder at index 0. Matches the construction pattern used
by createSubtreeMetadataWithParents and by the existing PR #889
migration in TestHelper.go.
Copy link
Copy Markdown
Collaborator

@freemans13 freemans13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are failing tests but aside from that this is fab

icellan added 5 commits May 19, 2026 11:06
go-subtree v1.4.0's NewTxInpoints() returns the zero value (nil
ParentTxHashes). Meta.Serialize rejects nil ParentTxHashes for
non-coinbase nodes, so createValidSubtreeMetadata and the two
inline-init sites in Block_test.go fail with "cannot serialize, parent
tx hashes are not set for node N" when constructing test fixtures.

Pre-v1.4.0 NewTxInpoints returned a cap-8 empty slice, so the same
code path worked. Replace those five sites with an explicit
TxInpoints{ParentTxHashes: []chainhash.Hash{}} literal — semantically
identical and unambiguous about the invariant the test relies on.

The proper fix is upstream — go-subtree PR #126 restores the non-nil
empty ParentTxHashes from NewTxInpoints. Once v1.4.1 lands these
literals can be reverted to NewTxInpoints() calls.
v1.4.1 lands the NewTxInpoints empty-but-non-nil ParentTxHashes fix
(go-subtree #126). With that in place the interim Block_test.go
workaround reverted in the previous commit is no longer needed.
Two findings from external review of PR #889 addressed:

1. Proto field-number reuse: fields 8/9 previously carried different
   semantics than the new packed-layout fields. Same wire type means a
   misversioned client silently decodes into the wrong fields. Renumber
   the new fields to 10/11 and add `reserved 8, 9` (+ reserved names).
   Even with the lock-step-deploy operational model, this prevents the
   silent semantic break if a stray old-version request ever arrives.

2. Offset-validation gaps in AddTxBatchColumnar: the handler indexed
   into parents[ParentTxOffsets[i]:ParentTxOffsets[i+1]] and similarly
   for VoutIdxsPacked without checking offset endpoints, bounds, or
   monotonicity. A malformed request triggered a
   slice-bounds-out-of-range panic. grpc-go does not recover handler
   panics — a single bad packet would crash the entire single-instance
   block-assembly process. Add cheap endpoint + monotonicity checks
   that guarantee every per-tx slice expression is bounds-safe.

The validator is trusted at the semantic layer, so we do not walk the
packed voutIdxs to verify the count-prefix invariant (that walk would
be O(B*P) on the hot path); we only do what is required to convert a
process-crashing panic into a clean InvalidArgument response.

Cost: 419 ns per 1000-tx batch (Apple M3 Max, isolated validation
loop benchmark). At 1M TPS with batch 1000 that is ~0.04 % of a core.

Tests:
   - TestAddTxBatchColumnar_RejectsOOBParentTxOffsets
   - TestAddTxBatchColumnar_RejectsNonMonotonicOffsets
   - TestAddTxBatchColumnar_RejectsOOBVoutIdxsTxOffsets
   - BenchmarkAddTxBatchColumnar_Validation (full handler, 1465 ns/op)
   - BenchmarkOffsetValidationLoop (isolated, 419 ns/op)

Defensive comment added in Server.go documenting the proto.Unmarshal
aliasing assumption — the unsafe.Slice on ParentTxHashesPacked relies
on protobuf decoding bytes fields into fresh Go-heap allocations
rather than aliasing the gRPC receive buffer. A future zero-copy codec
would invalidate this and require rework.
@sonarqubecloud
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

@oskarszoon oskarszoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both prior critical blockers are closed:

  • C1 (offset validation): cb21b2240 adds 4 endpoint checks + monotonicity loop in AddTxBatchColumnar. Every slice expression in the per-tx loop is bounds-safe; new tests cover OOB and non-monotonic cases. Cost ~0.04% of a core at 1M TPS / batch 1000.
  • C2 (proto field reuse): .proto reserves field numbers 8 and 9 plus the old field names; new fields renumbered to 10/11. .pb.go regenerated cleanly.

go-subtree v1.4.0 → v1.4.1 is a correctness-only bump (empty-but-non-nil ParentTxHashes), zero perf delta. The Merkle-root lift in submitMiningSolution is not new in this PR — it came in via the merge from main.

@icellan icellan merged commit f896e60 into main May 19, 2026
30 checks passed
@icellan icellan deleted the perf/txinpoints-packed-layout branch May 19, 2026 11:28
icellan added a commit to icellan/teranode that referenced this pull request May 19, 2026
A rolling-deploy mismatch where one block-assembly replica predates bsv-blockchain#889
silently drops to the row path; without a log line operators
investigating "why is this batch slow" can't tell the fallback fired.
Debug level keeps it out of normal log volume but available when
needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants