perf: blocked skip-scan for hdr_value_at_percentiles batch (+134%, stacked on #140)#141
Open
fcostaoliveira wants to merge 2 commits into
Open
Conversation
hdr_value_at_percentiles resolved percentiles via a per-bucket hdr_iter_next walk. When normalizing_index_offset == 0 (the common case), replace it with one tight prefix-sum scan over the flat counts[] array (index->value conversion only at crossings). Decoded/rotated histograms (offset != 0) keep the offset-aware iterator path. Percentiles must be ascending, as before; results are byte-identical. Adds a regression test (test_value_at_percentiles_with_offset) that drives a rotated histogram with normalizing_index_offset != 0 and asserts identical percentiles to the unrotated one, so CI covers the offset-aware fallback.
The single-pass batch scan (offset==0 fast path) tested the next pending target on every counts[] element. Sum a block of 8 counters at a time (a reduction that autovectorizes to AVX2/NEON under -march=native) and skip the whole block while its subtotal cannot reach values[at_pos]; only the crossing block is walked element by element to resolve the thresholds inside it. Counts are non-negative, so a block whose subtotal cannot reach the target contains no crossing element — results are identical to the per-element scan for any input. Offset-aware iterator fallback (normalizing_index_offset != 0) is unchanged. gnr1 (Granite Rapids), single core, -O3 -march=native: hdr_value_at_percentiles 86.8K -> 203.4K calls/sec (+134%); singular read and write unchanged; percentile values byte-identical. Adds test_value_at_percentiles_blocked_parity: dense histogram, fine percentile sweep, blocked fast path vs offset-aware iterator reference, asserts byte-identical.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #140 (single-pass
hdr_value_at_percentiles). Stacked on #140 — the first commithere is #140; this PR's contribution is the second commit, a blocked skip-scan for that batch
fast path.
#140's offset==0 fast path resolves all percentiles in one
counts[]prefix-sum, but it tests thenext pending target on every element. This sums a block of 8 counters at a time — a reduction the
compiler autovectorizes (AVX2 under
-march=native, NEON on aarch64, scalar elsewhere; nointrinsics, no runtime dispatch) — and skips the whole block while its subtotal cannot reach
values[at_pos]. Only the single crossing block is walked element-by-element to resolve thethresholds inside it. Counts are non-negative, so a block whose subtotal cannot reach the target
contains no crossing element: results are identical to the per-element scan for any input. The
offset-aware iterator fallback (
normalizing_index_offset != 0) is untouched.Benchmark
gnr1(Intel Granite Rapids), single core,-O3 -march=native, same-session interleaved A/B (base = #140 tip vs this branch), 2 rounds:hdr_value_at_percentiles(all-7, calls/sec)hdr_value_at_percentilesingular (Mq/s)hdr_record_value(M ops/sec)Percentile values byte-identical across base and patch (
sink/bsinkunchanged).Tests
test_value_at_percentiles_blocked_parity: a dense histogram (many populated buckets) resolvedvia the blocked fast path must return byte-identical values to the offset-aware iterator reference,
across a fine percentile sweep (edges + block boundaries).
ctest4/4 green; builds clean under gcc & clang-Wall -Wextra -Werror -std=c99; ASan/UBSanUB-clean; portable (no per-file flags, valid C99).