Skip to content

perf: blocked skip-scan for hdr_value_at_percentiles batch (+134%, stacked on #140)#141

Open
fcostaoliveira wants to merge 2 commits into
HdrHistogram:mainfrom
fcostaoliveira:perf/blocked-batch-scan-clean
Open

perf: blocked skip-scan for hdr_value_at_percentiles batch (+134%, stacked on #140)#141
fcostaoliveira wants to merge 2 commits into
HdrHistogram:mainfrom
fcostaoliveira:perf/blocked-batch-scan-clean

Conversation

@fcostaoliveira

Copy link
Copy Markdown
Contributor

Summary

Follow-up to #140 (single-pass hdr_value_at_percentiles). Stacked on #140 — the first commit
here is #140; this PR's contribution is the second commit, a blocked skip-scan for that batch
fast path.

#140's offset==0 fast path resolves all percentiles in one counts[] prefix-sum, but it tests the
next pending target on every element. This sums a block of 8 counters at a time — a reduction the
compiler autovectorizes (AVX2 under -march=native, NEON on aarch64, scalar elsewhere; no
intrinsics, no runtime dispatch
) — and skips the whole block while its subtotal cannot reach
values[at_pos]. Only the single crossing block is walked element-by-element to resolve the
thresholds inside it. Counts are non-negative, so a block whose subtotal cannot reach the target
contains no crossing element: results are identical to the per-element scan for any input. The
offset-aware iterator fallback (normalizing_index_offset != 0) is untouched.

Benchmark

gnr1 (Intel Granite Rapids), single core, -O3 -march=native, same-session interleaved A/B (base = #140 tip vs this branch), 2 rounds:

metric base (#140) this branch Δ
hdr_value_at_percentiles (all-7, calls/sec) 86,789 203,380 +134% (2.34×)
hdr_value_at_percentile singular (Mq/s) 0.5550 0.5550 0 (untouched)
hdr_record_value (M ops/sec) 408.4 409.2 flat

Percentile values byte-identical across base and patch (sink/bsink unchanged).

Tests

  • New test_value_at_percentiles_blocked_parity: a dense histogram (many populated buckets) resolved
    via the blocked fast path must return byte-identical values to the offset-aware iterator reference,
    across a fine percentile sweep (edges + block boundaries).
  • ctest 4/4 green; builds clean under gcc & clang -Wall -Wextra -Werror -std=c99; ASan/UBSan
    UB-clean; portable (no per-file flags, valid C99).

Filipe Oliveira and others added 2 commits July 2, 2026 15:40
hdr_value_at_percentiles resolved percentiles via a per-bucket hdr_iter_next walk.
When normalizing_index_offset == 0 (the common case), replace it with one tight
prefix-sum scan over the flat counts[] array (index->value conversion only at
crossings). Decoded/rotated histograms (offset != 0) keep the offset-aware iterator
path. Percentiles must be ascending, as before; results are byte-identical.

Adds a regression test (test_value_at_percentiles_with_offset) that drives a rotated
histogram with normalizing_index_offset != 0 and asserts identical percentiles to the
unrotated one, so CI covers the offset-aware fallback.
The single-pass batch scan (offset==0 fast path) tested the next pending target on
every counts[] element. Sum a block of 8 counters at a time (a reduction that
autovectorizes to AVX2/NEON under -march=native) and skip the whole block while its
subtotal cannot reach values[at_pos]; only the crossing block is walked element by
element to resolve the thresholds inside it. Counts are non-negative, so a block whose
subtotal cannot reach the target contains no crossing element — results are identical
to the per-element scan for any input. Offset-aware iterator fallback
(normalizing_index_offset != 0) is unchanged.

gnr1 (Granite Rapids), single core, -O3 -march=native: hdr_value_at_percentiles
86.8K -> 203.4K calls/sec (+134%); singular read and write unchanged; percentile
values byte-identical.

Adds test_value_at_percentiles_blocked_parity: dense histogram, fine percentile sweep,
blocked fast path vs offset-aware iterator reference, asserts byte-identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant