perf(inpoints): pack TxInpoints vouts into a single allocation#125
Merged
Conversation
Replace the nested Idxs [][]uint32 field with a count-prefixed packed uint32 slice. Removes one heap allocation per parent per TxInpoints and removes the cap-8/cap-16 over-allocation in NewTxInpoints — that pre-allocation was sized for many-input txs but in practice most txs have 1-2 inputs and the slack dominated per-tx memory. The constructor now sizes both internal buffers to len(tx.Inputs), the upper bound on parent count and total vouts. The no-grow guarantee the original cap-8/cap-16 was aiming for is preserved without paying for unused slack. The public field Idxs is removed (not renamed) so external code on the old API fails to compile on upgrade rather than silently misuse the new layout. ParentTxHashes stays exported — its semantics are unchanged. Wire format unchanged. The on-wire encoding [count_i, vals...] for each parent is byte-identical to the new internal layout. Benchmarks (Apple M3 Max, single-input tx, deserialize path measured with identical hand-crafted wire bytes; build path measured via NewTxInpointsFromTx): Build (1 input): 146.4 ns / 644 B / 3 allocs → 23.0 ns / 40 B / 2 allocs Deserialize 1 input: 81.1 ns / 112 B / 5 allocs → 62.1 ns / 96 B / 4 allocs Deserialize 10 inputs: 328.7 ns / 652 B / 14 allocs → 220.0 ns / 452 B / 4 allocs Deserialize 100 inputs: 2596 ns / 6340 B / 104 allocs → 1639 ns / 4148 B / 4 allocs Allocation count for the deserialize path is now constant at 4 regardless of input count, vs scaling linearly before. The build-path win (-94% bytes, -84% time) is the dominant saving at validator-side ingestion rates of millions of TPS.
Contributor
Author
|
@mrz1836 Breaking changes. Deserves bump to v1.4 |
Adds a hot-path constructor that wraps caller-owned (parents, voutIdxs) slices directly into a TxInpoints without copying, allocating, or validating. The caller asserts the count-prefix invariant; this is the shape produced by Serialize and by upcoming columnar gRPC handlers that have already trusted the upstream service to emit well-formed data. Benchmark on Apple M3 Max — 0.27 ns/op, 0 B/op, 0 allocs/op (compiler inlines the struct literal). Compare with the 23 ns / 40 B / 2 allocs NewTxInpointsFromTx path. Intended consumer: teranode's AddTxBatchColumnar handler, which already holds the packed layout in a per-batch buffer and previously had to rebuild a [][]uint32 per tx. With this constructor block-assembly becomes two slice operations per tx.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Replaces the nested
Idxs [][]uint32field onTxInpointswith an unexported count-prefixed packed[]uint32. Drops one heap allocation per parent perTxInpointsand removes the cap-8/cap-16 over-allocation inNewTxInpoints— that pre-allocation was sized for many-input transactions, but in practice most txs have 1-2 inputs and the unused slack dominated per-tx memory.The constructor now sizes both internal buffers to
len(tx.Inputs), the upper bound on parent count and total vouts. The no-grow guarantee the original cap-8/cap-16 was aiming for is preserved without paying for unused capacity.Wire format unchanged
The on-wire encoding
[count_i, vals...]per parent is byte-identical to the new internal layout, so the deserializer writes straight into a single allocation. Serialised blobs interop across versions without bumps.Breaking change — intentional
The public field
Idxsis removed rather than renamed. External callers on the old API will get a compile error rather than silently misuse the new layout. Migration:txi.Idxs[i]→txi.GetParentVoutsAtIndex(i)txi.GetTxInpoints()Idxs: [][]uint32{...}→ build viaNewTxInpointsFromTx/NewTxInpointsFromInputs, or useappendInputfrom within the packageParentTxHashesstays exported — its semantics are unchanged.Benchmarks
Apple M3 Max, single-input tx via
NewTxInpointsFromTx; deserialize paths use identical hand-crafted wire bytes for apples-to-apples comparison.NewTxInpointsFromTx)Deserialize allocation count is now constant at 4 regardless of input count, vs scaling linearly with parent count before. The build-path win (-94 % bytes, -84 % time) is the dominant saving at validator-side ingestion rates.
Test plan
go vet ./...go test -race ./...golangci-lint run ./...— no new issues (4 pre-existing G115 warnings inmmap_unix.go/subtree_fuzz_test.goremain, unrelated)TestTxInpoints_DedupAndRoundTripcovers multi-parent dedup + wire round-tripteranodemigratingservices/blockassembly/Client.gooff the removedIdxsfield