Skip to content

Bulk-fill definition levels for majority-null leaf columns#9967

Open
RyanJamesStewart wants to merge 6 commits into
apache:mainfrom
RyanJamesStewart:perf/arrow-parquet-null-heavy
Open

Bulk-fill definition levels for majority-null leaf columns#9967
RyanJamesStewart wants to merge 6 commits into
apache:mainfrom
RyanJamesStewart:perf/arrow-parquet-null-heavy

Conversation

@RyanJamesStewart
Copy link
Copy Markdown

@RyanJamesStewart RyanJamesStewart commented May 13, 2026

Which issue does this PR close?

AI assistance

Implementation drafted with AI assistance and iterated against the benchmarks below. I've reviewed and own the code, including the gate threshold which I picked from the sweep in Threshold (BULK_FILL_MIN_LEN). Per the project's CONTRIBUTING guidance on AI-generated submissions.

Rationale for this change

When writing a nullable leaf (primitive) Arrow array, write_leaf builds the definition-level buffer one element at a time, mapping each null bit to a level. For columns that are mostly null this does ~num_rows of branchy work and allocates a num_rows-element level buffer even though almost every produced level is the same value. #9954 adds an O(1) fast path for the entirely null case; this PR covers the sparse (mostly-but-not-entirely null) case it doesn't handle, the literal subject of #9731 ("a column that is 99% null … ~100x more work than necessary").

What changes are included in this PR?

A single popcount pass over the null mask (Buffer::count_set_bits_offset, O(num_rows/64)) counts the valid values in the range. When the slice is majority-null, the definition-level buffer is bulk-filled with the null level (a vectorized Vec::resize memset) and only the non-null positions (from NullBuffer::valid_indices()) are overwritten. The existing per-row path is kept for non-majority-null slices, so balanced and null-light columns are unaffected. Both branches share the same let range_nulls = nulls.slice(range.start, len) slicing idiom; the slow path uses range_nulls.iter() for the def-level map and range_nulls.valid_indices().map(|i| i + range.start) for non_null_indices, with no unsafe. Output is byte-identical: the level values are unchanged, just produced via memset+scatter (fast path) or via the high-level NullBuffer iterators (slow path) instead of a manual BitIndexIterator walk.

Threshold (BULK_FILL_MIN_LEN)

The bulk-fill fast path is gated on two conditions:

  • len >= BULK_FILL_MIN_LEN (currently 64). Per-call slice/popcount/iterator overhead only amortizes on sizable sub-ranges. List/struct paths call write_leaf many times with tiny ranges (avg list length 1-5); paying any per-call popcount there would regress them. A threshold sweep at T = {0, 16, 32, 64, 128, 256} on Ryzen 9 9950X shows the regression floor settles by T=32, and the choice of 64 gives ~12x margin over the average list length without losing the flat-primitive wins.
  • nulls.null_count() * 2 >= nulls.len(). The cached null_count() is O(1), so this check is free. We use the buffer-wide density as a heuristic for the sub-range; for full-array writes (the primary target, flat primitive columns) it's exact.

Even when the gate skips the fast path, evaluating it across high-frequency call sites (~10K calls in some list benchmarks) is a small structural cost (~1-2% on list-sparse cases). The wins on the targeted shapes (-35% sparse-primitive, -66% all-null primitive) far outweigh that. Reducing the cost further would require hoisting the decision into the caller.

Are these changes tested?

Existing tests cover this path: cargo test -p parquet --features arrow --lib arrow_writer is green (136 tests, full of nulls and roundtrips); full cargo test -p parquet --features arrow green modulo the pre-existing PARQUET_TEST_DATA submodule failures (unrelated, same on main). cargo clippy -p parquet --features arrow --lib and cargo fmt --check clean. The unsafe get_unchecked_mut flagged in the original revision was replaced via NullBuffer::valid_indices(); the slow-path also dropped its unsafe value_unchecked for the same reason.

Are there any user-facing changes?

None.

Benchmarks

cargo bench -p parquet --bench arrow_writer, 1M rows × 7 nullable primitive columns, local Ryzen 9 9950X:

primitive_sparse_99pct_null/default   11.88 ms -> 9.13 ms   (-23%)   <- the case #9731 calls out
primitive_all_null/default             5.65 ms -> 2.33 ms   (-59%)   (subsumed by #9954's O(1) path if that lands first)
struct_sparse_99pct_null/default       5.67 ms -> 5.32 ms   (-6%)
struct_all_null/default                1.52 ms -> 1.31 ms   (-14%)
list_primitive_sparse_99pct_null, primitive (25% null), primitive_non_null, bool, string:  within noise (no regression)

The CI benchmark bot (GKE c4a-highmem-16, Neoverse-V2) on the post-fixup revision shows the same shape with stronger relative wins on the targeted cases:

primitive_all_null/default              2.47x (11.0ms -> 4.4ms)
primitive_sparse_99pct_null/default     1.60x (16.8ms -> 10.5ms)
primitive_all_null/{bloom_filter,cdc,parquet_2,zstd,zstd_parquet_2}    1.38x to 2.48x
primitive_sparse_99pct_null/{...}        1.28x to 1.59x
list_primitive*, list_primitive_sparse_99pct_null*:                    1.00x to 1.01x (within noise)

Microbench of the definition-level fill in isolation: 10.3x @ 100%-null, 8.6x @ 99%, 5.2x @ 90%, 1.9x @ 50%, 0.93x @ 10%, 0.81x @ 0%. Crossover ≈ 12-15% null, clean win above ~25%; the >= 50% null guard is conservative.

This is the materialization-cost half of #9731 (~30% of the 99%-null write); the walk-cost half, a run-length input to the level encoder so the column writer doesn't even iterate all num_rows levels, is the larger structural change #9653 is heading toward. This PR is deliberately small and isolated so it lands independently of and rebases cleanly under that work.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 13, 2026
Copy link
Copy Markdown
Contributor

@HippoBaro HippoBaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love it! I think the code can be simplified a bit, and we can probably remove the unsafe by slicing the NullBuffer and using nulls.valid_indices()/nulls.iter() instead of constructing BitIndexIterator directly.

NullBuffer will do the BitIndexIterator business for you!

max_def_level - (!valid as i16)
}));
}
info.non_null_indices.reserve(len - valid_in_range);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the null count right? We should reserve valid_in_range directly I think?

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 13, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4437340557-25-r92nk 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (7341370) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              main                                   perf_arrow-parquet-null-heavy
-----                                              ----                                   -----------------------------
bool/bloom_filter                                  1.00     13.2±0.06ms    19.0 MB/sec    1.00     13.2±0.04ms    18.9 MB/sec
bool/cdc                                           1.00     15.9±0.07ms    15.8 MB/sec    1.01     16.1±0.05ms    15.6 MB/sec
bool/default                                       1.00     11.1±0.05ms    22.6 MB/sec    1.00     11.1±0.03ms    22.6 MB/sec
bool/parquet_2                                     1.00     14.8±0.07ms    16.9 MB/sec    1.00     14.9±0.04ms    16.8 MB/sec
bool/zstd                                          1.00     11.6±0.07ms    21.6 MB/sec    1.00     11.6±0.03ms    21.5 MB/sec
bool/zstd_parquet_2                                1.00     15.2±0.07ms    16.5 MB/sec    1.01     15.4±0.03ms    16.3 MB/sec
bool_non_null/bloom_filter                         1.00      7.0±0.03ms    17.8 MB/sec    1.00      7.0±0.03ms    17.8 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.03ms    18.2 MB/sec    1.00      6.9±0.04ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    29.2 MB/sec    1.00      4.3±0.02ms    29.2 MB/sec
bool_non_null/parquet_2                            1.01      9.1±0.04ms    13.7 MB/sec    1.00      9.1±0.04ms    13.8 MB/sec
bool_non_null/zstd                                 1.00      4.6±0.02ms    27.1 MB/sec    1.00      4.6±0.03ms    27.0 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.03ms    13.2 MB/sec    1.00      9.5±0.03ms    13.2 MB/sec
float_with_nans/bloom_filter                       1.01     95.1±0.31ms   147.3 MB/sec    1.00     94.2±0.32ms   148.7 MB/sec
float_with_nans/cdc                                1.01     82.5±0.34ms   169.7 MB/sec    1.00     82.1±0.19ms   170.6 MB/sec
float_with_nans/default                            1.00     74.9±0.23ms   186.9 MB/sec    1.00     74.6±0.19ms   187.7 MB/sec
float_with_nans/parquet_2                          1.01     96.3±0.43ms   145.4 MB/sec    1.00     95.2±0.34ms   147.1 MB/sec
float_with_nans/zstd                               1.00    112.8±0.24ms   124.1 MB/sec    1.00    112.6±0.21ms   124.3 MB/sec
float_with_nans/zstd_parquet_2                     1.01    133.9±0.72ms   104.6 MB/sec    1.00    132.6±0.30ms   105.6 MB/sec
list_primitive/bloom_filter                        1.00    339.7±7.94ms  1605.5 MB/sec    1.06    359.3±1.84ms  1517.9 MB/sec
list_primitive/cdc                                 1.00    362.0±1.69ms  1506.6 MB/sec    1.10    397.4±3.24ms  1372.4 MB/sec
list_primitive/default                             1.00    252.9±0.91ms     2.1 GB/sec    1.11    281.4±4.20ms  1938.1 MB/sec
list_primitive/parquet_2                           1.00    269.8±0.37ms  2021.2 MB/sec    1.12    301.2±0.68ms  1810.8 MB/sec
list_primitive/zstd                                1.00    504.5±0.74ms  1081.0 MB/sec    1.06    533.5±0.96ms  1022.3 MB/sec
list_primitive/zstd_parquet_2                      1.00    494.3±0.46ms  1103.4 MB/sec    1.07    526.5±1.89ms  1035.9 MB/sec
list_primitive_non_null/bloom_filter               1.00   438.7±13.58ms  1240.7 MB/sec    1.01   441.8±15.18ms  1232.0 MB/sec
list_primitive_non_null/cdc                        1.01    440.4±9.52ms  1235.9 MB/sec    1.00    436.1±5.38ms  1247.9 MB/sec
list_primitive_non_null/default                    1.01    305.8±5.62ms  1779.6 MB/sec    1.00    302.4±6.09ms  1799.7 MB/sec
list_primitive_non_null/parquet_2                  1.00    331.0±5.52ms  1644.3 MB/sec    1.02    337.5±5.48ms  1612.8 MB/sec
list_primitive_non_null/zstd                       1.00   725.1±12.58ms   750.6 MB/sec    1.00    728.1±9.90ms   747.5 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    677.4±2.22ms   803.5 MB/sec    1.03    700.8±6.19ms   776.6 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.3±0.09ms     3.2 GB/sec    1.09     12.2±0.11ms     3.0 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.8±0.08ms  1636.1 MB/sec    1.04     23.7±0.06ms  1579.1 MB/sec
list_primitive_sparse_99pct_null/default           1.00     10.9±0.08ms     3.3 GB/sec    1.08     11.7±0.07ms     3.1 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     10.9±0.09ms     3.3 GB/sec    1.10     12.0±0.17ms     3.0 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     12.8±0.07ms     2.8 GB/sec    1.06     13.7±0.07ms     2.7 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     11.1±0.06ms     3.3 GB/sec    1.08     12.0±0.08ms     3.0 GB/sec
primitive/bloom_filter                             1.02    155.1±0.56ms   289.3 MB/sec    1.00    152.1±0.49ms   295.1 MB/sec
primitive/cdc                                      1.01    161.6±0.60ms   277.7 MB/sec    1.00    159.2±0.56ms   281.8 MB/sec
primitive/default                                  1.01    120.4±0.39ms   372.8 MB/sec    1.00    119.0±0.37ms   377.0 MB/sec
primitive/parquet_2                                1.00    135.8±0.69ms   330.3 MB/sec    1.00    136.3±4.52ms   329.1 MB/sec
primitive/zstd                                     1.01    150.0±0.41ms   299.2 MB/sec    1.00    148.3±0.29ms   302.5 MB/sec
primitive/zstd_parquet_2                           1.02    170.9±4.71ms   262.5 MB/sec    1.00    167.0±0.31ms   268.7 MB/sec
primitive_all_null/bloom_filter                    2.29     11.6±0.14ms     3.8 GB/sec    1.00      5.1±0.08ms     8.6 GB/sec
primitive_all_null/cdc                             1.39     30.6±0.33ms  1464.9 MB/sec    1.00     22.1±0.27ms  2033.6 MB/sec
primitive_all_null/default                         2.48     11.0±0.18ms     4.0 GB/sec    1.00      4.4±0.10ms     9.9 GB/sec
primitive_all_null/parquet_2                       2.50     11.0±0.25ms     4.0 GB/sec    1.00      4.4±0.06ms    10.0 GB/sec
primitive_all_null/zstd                            2.43     11.1±0.19ms     4.0 GB/sec    1.00      4.6±0.09ms     9.6 GB/sec
primitive_all_null/zstd_parquet_2                  2.45     11.0±0.13ms     4.0 GB/sec    1.00      4.5±0.08ms     9.8 GB/sec
primitive_non_null/bloom_filter                    1.09    118.7±1.39ms   370.7 MB/sec    1.00    109.2±0.37ms   402.8 MB/sec
primitive_non_null/cdc                             1.02     92.0±0.49ms   478.3 MB/sec    1.00     90.5±0.28ms   486.0 MB/sec
primitive_non_null/default                         1.02     69.5±0.34ms   633.5 MB/sec    1.00     68.1±0.27ms   645.8 MB/sec
primitive_non_null/parquet_2                       1.02     91.3±0.20ms   481.7 MB/sec    1.00     89.3±0.31ms   492.8 MB/sec
primitive_non_null/zstd                            1.10    107.8±0.22ms   408.1 MB/sec    1.00     98.2±0.26ms   448.0 MB/sec
primitive_non_null/zstd_parquet_2                  1.02    132.4±1.88ms   332.4 MB/sec    1.00    129.7±1.07ms   339.1 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.50     18.6±0.15ms     2.4 GB/sec    1.00     12.4±0.18ms     3.5 GB/sec
primitive_sparse_99pct_null/cdc                    1.27     37.4±0.26ms  1198.4 MB/sec    1.00     29.6±0.27ms  1516.5 MB/sec
primitive_sparse_99pct_null/default                1.58     17.0±0.05ms     2.6 GB/sec    1.00     10.7±0.08ms     4.1 GB/sec
primitive_sparse_99pct_null/parquet_2              1.57     16.9±0.08ms     2.6 GB/sec    1.00     10.8±0.08ms     4.1 GB/sec
primitive_sparse_99pct_null/zstd                   1.43     20.3±0.12ms     2.2 GB/sec    1.00     14.2±0.09ms     3.1 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.47     18.9±0.14ms     2.3 GB/sec    1.00     12.9±0.10ms     3.4 GB/sec
string/bloom_filter                                1.04   246.3±27.74ms     2.1 GB/sec    1.00   236.1±24.11ms     2.2 GB/sec
string/cdc                                         1.02    225.8±8.14ms     2.3 GB/sec    1.00    221.4±3.72ms     2.3 GB/sec
string/default                                     1.20   150.4±26.78ms     3.4 GB/sec    1.00   125.7±12.44ms     4.1 GB/sec
string/parquet_2                                   1.16    129.0±0.30ms     4.0 GB/sec    1.00    111.0±7.31ms     4.6 GB/sec
string/zstd                                        1.03    433.7±9.11ms  1208.8 MB/sec    1.00    420.9±1.71ms  1245.4 MB/sec
string/zstd_parquet_2                              1.00    397.1±0.47ms  1320.1 MB/sec    1.00    396.5±1.96ms  1322.2 MB/sec
string_and_binary_view/bloom_filter                1.04     67.5±0.28ms   477.6 MB/sec    1.00     65.0±0.20ms   495.9 MB/sec
string_and_binary_view/cdc                         1.02     59.9±0.19ms   538.4 MB/sec    1.00     58.9±0.11ms   547.4 MB/sec
string_and_binary_view/default                     1.01     49.2±0.17ms   655.5 MB/sec    1.00     48.5±0.12ms   664.7 MB/sec
string_and_binary_view/parquet_2                   1.02     60.2±0.17ms   535.8 MB/sec    1.00     59.2±0.18ms   544.4 MB/sec
string_and_binary_view/zstd                        1.01     85.7±0.20ms   376.1 MB/sec    1.00     85.1±0.13ms   378.8 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     74.0±0.18ms   435.8 MB/sec    1.00     73.8±0.14ms   436.9 MB/sec
string_dictionary/bloom_filter                     1.19    110.8±1.11ms     2.3 GB/sec    1.00     93.0±0.64ms     2.8 GB/sec
string_dictionary/cdc                              1.42     77.6±1.78ms     3.3 GB/sec    1.00     54.5±0.38ms     4.7 GB/sec
string_dictionary/default                          1.29     64.9±1.71ms     4.0 GB/sec    1.00     50.3±0.42ms     5.1 GB/sec
string_dictionary/parquet_2                        1.26     67.7±0.22ms     3.8 GB/sec    1.00     53.9±0.40ms     4.8 GB/sec
string_dictionary/zstd                             1.03    219.1±2.34ms  1205.4 MB/sec    1.00    211.8±0.44ms  1247.1 MB/sec
string_dictionary/zstd_parquet_2                   1.01    200.4±0.45ms  1318.3 MB/sec    1.00    199.1±0.28ms  1326.9 MB/sec
string_non_null/bloom_filter                       1.09   279.9±17.02ms  1871.8 MB/sec    1.00   256.2±14.57ms  2045.3 MB/sec
string_non_null/cdc                                1.00   277.0±12.86ms  1891.9 MB/sec    1.00   277.0±12.19ms  1892.0 MB/sec
string_non_null/default                            1.03   149.1±15.28ms     3.4 GB/sec    1.00   144.7±20.26ms     3.5 GB/sec
string_non_null/parquet_2                          1.11    147.3±9.72ms     3.5 GB/sec    1.00    132.3±4.12ms     3.9 GB/sec
string_non_null/zstd                               1.03   574.1±12.24ms   912.7 MB/sec    1.00   557.0±19.22ms   940.8 MB/sec
string_non_null/zstd_parquet_2                     1.01   529.2±12.48ms   990.1 MB/sec    1.00    523.2±6.02ms  1001.6 MB/sec
struct_all_null/bloom_filter                       1.01      2.6±0.00ms     6.2 GB/sec    1.00      2.5±0.00ms     6.2 GB/sec
struct_all_null/cdc                                1.05      9.7±0.08ms  1655.1 MB/sec    1.00      9.3±0.15ms  1732.4 MB/sec
struct_all_null/default                            1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00      2.3±0.00ms     6.8 GB/sec    1.00      2.3±0.00ms     6.9 GB/sec
struct_all_null/zstd_parquet_2                     1.00      2.3±0.00ms     6.9 GB/sec    1.00      2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     46.5±0.13ms   344.2 MB/sec    1.04     48.3±0.14ms   331.3 MB/sec
struct_non_null/cdc                                1.00     45.5±0.15ms   351.5 MB/sec    1.00     45.5±0.18ms   351.9 MB/sec
struct_non_null/default                            1.00     32.2±0.13ms   497.3 MB/sec    1.00     32.0±0.14ms   499.4 MB/sec
struct_non_null/parquet_2                          1.00     41.0±0.12ms   390.5 MB/sec    1.00     41.2±0.12ms   388.6 MB/sec
struct_non_null/zstd                               1.00     40.9±0.13ms   391.0 MB/sec    1.00     40.8±0.11ms   392.6 MB/sec
struct_non_null/zstd_parquet_2                     1.00     55.0±0.13ms   291.0 MB/sec    1.01     55.6±0.13ms   287.7 MB/sec
struct_sparse_99pct_null/bloom_filter              1.02      7.8±0.05ms     2.0 GB/sec    1.00      7.6±0.08ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.08     15.5±0.13ms  1040.1 MB/sec    1.00     14.4±0.09ms  1118.5 MB/sec
struct_sparse_99pct_null/default                   1.00      7.0±0.04ms     2.2 GB/sec    1.00      7.1±0.05ms     2.2 GB/sec
struct_sparse_99pct_null/parquet_2                 1.01      7.1±0.05ms     2.2 GB/sec    1.00      7.0±0.04ms     2.2 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.4±0.05ms  1919.9 MB/sec    1.00      8.4±0.05ms  1917.6 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.8±0.07ms     2.0 GB/sec    1.00      7.8±0.06ms     2.0 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1975.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1897.0s
CPU sys 75.2s
Peak spill 0 B

branch

Metric Value
Wall time 1955.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1895.5s
CPU sys 58.3s
Peak spill 0 B

File an issue against this benchmark runner

…umns

When writing a nullable leaf (primitive) Arrow array, `write_leaf` built the
definition-level buffer one element at a time, mapping each null bit to a
level. For columns that are mostly null this does ~num_rows of branchy work
and allocates a num_rows level buffer even though almost every level is the
same value.

Add a length-gated bulk-fill path: when the column is majority-null and the
sub-range is large enough to amortize the gate's per-call cost, build the
definition levels by bulk-filling the null level (a vectorized memset) and
overwriting only the non-null positions found via `NullBuffer::valid_indices()`.
The per-row path is kept for non-majority-null arrays and for the small
sub-ranges produced by list/struct write paths, so those shapes are not
regressed.

Contributes to apache#9731. Complements apache#9954's all-null fast path by covering the
sparse (mostly-but-not-entirely-null) case it does not handle.

Threshold sweep on Ryzen 9 9950X (parquet/arrow_writer benches, /default
variant, vs main):

  T   primitive  list_primitive  primitive_sparse  list_primitive_sparse
  ----------------------------------------------------------------------
  0    -3.0%      +2.6%           -36.1%            +7.8%
  16   -1.4%      +1.8%           -34.8%            +2.8%
  32   -1.1%      -0.1%           -35.1%            +1.7%
  64   -1.1%      +0.7%           -34.5%            +1.7%   <- chosen
  128  -1.0%      +1.5%           -35.1%            +2.4%
  256  -1.4%      +1.4%           -35.1%            +2.7%

T=0 reproduces the per-call slice/popcount regression on
list_primitive_sparse_99pct_null (+7.8%, matches the criterion bot's
original measurement on the unguarded version). The +1.7% floor at T>=32
is the structural cost of evaluating the gate itself across ~10K small
write_leaf calls in the list path; reducing it further would require
hoisting the decision into the caller. T=64 matches T=32 on every shape
and gives ~12x margin over the avg list length of ~5.

Final benchmarks vs main on Ryzen 9 9950X (T=64, /default variants):

  primitive/default                            -1.5%
  primitive_non_null/default                   -2.8%
  primitive_sparse_99pct_null/default         -35.1%
  primitive_all_null/default                  -66.4%
  list_primitive/default                       +1.8%  (within noise)
  list_primitive_non_null/default              -0.7%  (no change, p=0.45)
  list_primitive_sparse_99pct_null/default     +3.0%  (gate-check floor)
  struct_sparse_99pct_null/default             -4.9%
  bool/default                                 +2.2%
@RyanJamesStewart RyanJamesStewart force-pushed the perf/arrow-parquet-null-heavy branch from 7341370 to d825a1b Compare May 13, 2026 11:32
@RyanJamesStewart
Copy link
Copy Markdown
Author

Thanks — applied both. Switched the fast path to nulls.valid_indices() to drop the unsafe, and fixed the reserve to valid_in_range.

Also caught a regression the benchmark bot surfaced that I had missed in my own measurement — I hadn't benchmarked the list paths. list_primitive and list_primitive_sparse_99pct_null were ~6–12% slower because the per-range count_set_bits_offset and the under-allocated reserve(len - valid_in_range) were both being paid on every write_leaf call from write_listwrite_non_null_slice, where call counts are high (~10K) and per-call ranges are tiny (~5 elements avg). The bulk-fill payoff doesn't apply at that range size.

Added a length gate on entering the new path: len >= 64 && nulls.null_count() * 2 >= nulls.len(). The null_count() check uses the cached field (O(1)) so there's no per-range popcount when the global density is low. I swept T = {0, 16, 32, 64, 128, 256} on list_primitive_sparse_99pct_null to justify the choice:

T list_primitive_sparse_99pct_null
0 +7.8% (reproduces the bot's original measurement)
16 +2.8%
32 +1.7%
64 +1.7% ← chosen
128 +2.4%
256 +2.7%

Breakeven for the list-sparse case is between T=0 and T=32. The +1.7% floor at T≥32 is the structural cost of evaluating the gate across ~10K calls, not the fast-path execution; reducing it further would require hoisting the decision into write_list. T=64 matches T=32 on every shape with 12x margin over the avg list length of ~5 and keeps the wins intact: −1.5% on primitive, −35.1% on primitive_sparse_99pct_null, −66.4% on primitive_all_null vs main on Ryzen 9 9950X.

Re-triggering the bench.

@RyanJamesStewart
Copy link
Copy Markdown
Author

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

@RyanJamesStewart
Copy link
Copy Markdown
Author

RyanJamesStewart commented May 13, 2026

Pushed a fmt fixup (the BitIndexIterator::new(...) call on the new path was wrapped wider than rustfmt wanted). Rust workflow should go green now.

On the bench: the run on 7341370 flagged list_primitive / list_primitive_sparse_99pct_null at +8-12%. That was the per-range count_set_bits_offset plus the under-reserved scatter being paid on every write_leaf call from write_list to write_non_null_slice, where ranges average ~5 elements. The follow-up commit gates the new path on len >= 64 && null_count() * 2 >= len (using the cached null_count so there's no per-range popcount when global density is low), which on my sweep cleared the list regression to ~+1.7% (structural cost of evaluating the gate across ~10K calls; further reduction would need hoisting the decision into write_list).

On the bench re-trigger: would be good to have the post-gate numbers on record before further review, if anyone on the whitelist is willing to re-run run benchmark arrow_writer once CI goes green.

Comment on lines +688 to +698
let bits = nulls.inner();
info.def_levels.extend_from_iter(range.clone().map(|i| {
// Safety: range.end was asserted to be in bounds earlier
let valid = unsafe { bits.value_unchecked(i) };
max_def_level - (!valid as i16)
}));
info.non_null_indices.reserve(len);
info.non_null_indices.extend(
BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len)
.map(|i| i + range.start),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let bits = nulls.inner();
info.def_levels.extend_from_iter(range.clone().map(|i| {
// Safety: range.end was asserted to be in bounds earlier
let valid = unsafe { bits.value_unchecked(i) };
max_def_level - (!valid as i16)
}));
info.non_null_indices.reserve(len);
info.non_null_indices.extend(
BitIndexIterator::new(bits.inner(), bits.offset() + range.start, len)
.map(|i| i + range.start),
);
info.def_levels.extend_from_iter(
nulls.iter().map(|valid| max_def_level - (!valid as i16)),
);

No?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to make sure I'm reading the suggestion right. nulls here is the full-array NullBuffer (line 953, unsliced; we just assert range.end <= nulls.len()), so nulls.iter() would produce nulls.len() levels rather than len. Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e. let range_nulls = nulls.slice(range.start, len) and then iterating range_nulls? That would also let non_null_indices use range_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path. Or were you proposing a broader restructure where non_null_indices isn't populated on this branch?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you have in mind slicing first the way the bulk-fill branch a few lines up does, i.e. let range_nulls = nulls.slice(range.start, len) and then iterating range_nulls? That would also let non_null_indices use range_nulls.valid_indices().map(|i| i + range.start), which I think still needs to happen for the downstream value path

I think this.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 13, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4442664982-36-m8w5r 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (a717169) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              main                                   perf_arrow-parquet-null-heavy
-----                                              ----                                   -----------------------------
bool/bloom_filter                                  1.00     13.1±0.04ms    19.1 MB/sec    1.01     13.3±0.18ms    18.8 MB/sec
bool/cdc                                           1.00     15.8±0.05ms    15.9 MB/sec    1.03     16.2±0.10ms    15.5 MB/sec
bool/default                                       1.00     11.0±0.04ms    22.8 MB/sec    1.01     11.1±0.11ms    22.5 MB/sec
bool/parquet_2                                     1.00     14.7±0.05ms    17.0 MB/sec    1.01     14.9±0.11ms    16.8 MB/sec
bool/zstd                                          1.00     11.5±0.04ms    21.7 MB/sec    1.01     11.6±0.11ms    21.5 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.04ms    16.5 MB/sec    1.01     15.2±0.12ms    16.4 MB/sec
bool_non_null/bloom_filter                         1.00      7.0±0.03ms    17.8 MB/sec    1.01      7.1±0.03ms    17.6 MB/sec
bool_non_null/cdc                                  1.00      6.8±0.03ms    18.3 MB/sec    1.01      6.9±0.06ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.03ms    29.2 MB/sec    1.01      4.3±0.03ms    28.9 MB/sec
bool_non_null/parquet_2                            1.01      9.1±0.04ms    13.8 MB/sec    1.00      9.0±0.04ms    13.9 MB/sec
bool_non_null/zstd                                 1.00      4.6±0.03ms    27.1 MB/sec    1.01      4.7±0.02ms    26.8 MB/sec
bool_non_null/zstd_parquet_2                       1.01      9.5±0.03ms    13.2 MB/sec    1.00      9.4±0.04ms    13.3 MB/sec
float_with_nans/bloom_filter                       1.00     93.6±1.98ms   149.6 MB/sec    1.00     93.3±0.73ms   150.1 MB/sec
float_with_nans/cdc                                1.00     81.6±0.54ms   171.5 MB/sec    1.00     82.0±0.26ms   170.7 MB/sec
float_with_nans/default                            1.01     75.1±1.86ms   186.5 MB/sec    1.00     74.3±0.21ms   188.4 MB/sec
float_with_nans/parquet_2                          1.00     95.1±1.97ms   147.2 MB/sec    1.00     95.1±0.41ms   147.2 MB/sec
float_with_nans/zstd                               1.00    112.6±1.84ms   124.3 MB/sec    1.00    112.2±0.22ms   124.8 MB/sec
float_with_nans/zstd_parquet_2                     1.00    132.3±1.93ms   105.8 MB/sec    1.00    132.5±0.52ms   105.7 MB/sec
list_primitive/bloom_filter                        1.01    325.9±2.32ms  1673.6 MB/sec    1.00    322.7±0.86ms  1690.1 MB/sec
list_primitive/cdc                                 1.00    357.4±2.67ms  1525.8 MB/sec    1.00    356.5±0.57ms  1529.8 MB/sec
list_primitive/default                             1.00    247.1±0.82ms     2.2 GB/sec    1.00    247.5±1.17ms     2.2 GB/sec
list_primitive/parquet_2                           1.00    268.2±0.57ms  2033.1 MB/sec    1.00    267.8±0.54ms  2036.1 MB/sec
list_primitive/zstd                                1.00    498.2±2.16ms  1094.7 MB/sec    1.00    496.4±1.08ms  1098.6 MB/sec
list_primitive/zstd_parquet_2                      1.00    490.7±2.51ms  1111.4 MB/sec    1.00    490.4±0.44ms  1112.0 MB/sec
list_primitive_non_null/bloom_filter               1.00    419.3±4.15ms  1298.0 MB/sec    1.00    421.3±3.66ms  1291.9 MB/sec
list_primitive_non_null/cdc                        1.01    438.2±7.97ms  1242.0 MB/sec    1.00    435.3±6.98ms  1250.3 MB/sec
list_primitive_non_null/default                    1.00    286.9±3.47ms  1897.0 MB/sec    1.00    288.2±3.14ms  1888.3 MB/sec
list_primitive_non_null/parquet_2                  1.01   308.4±13.05ms  1764.8 MB/sec    1.00   305.1±15.36ms  1783.9 MB/sec
list_primitive_non_null/zstd                       1.01    715.8±4.58ms   760.3 MB/sec    1.00    707.2±9.05ms   769.6 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    683.4±4.38ms   796.3 MB/sec    1.01    687.0±0.46ms   792.2 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.1±0.22ms     3.3 GB/sec    1.00     11.1±0.05ms     3.3 GB/sec
list_primitive_sparse_99pct_null/cdc               1.01     22.4±0.10ms  1665.6 MB/sec    1.00     22.3±0.08ms  1675.3 MB/sec
list_primitive_sparse_99pct_null/default           1.00     10.8±0.05ms     3.4 GB/sec    1.00     10.8±0.05ms     3.4 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     10.7±0.04ms     3.4 GB/sec    1.01     10.8±0.04ms     3.4 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     12.5±0.04ms     2.9 GB/sec    1.01     12.6±0.04ms     2.9 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     10.8±0.03ms     3.4 GB/sec    1.01     10.9±0.03ms     3.3 GB/sec
primitive/bloom_filter                             1.00    148.6±0.67ms   301.9 MB/sec    1.00    149.2±0.68ms   300.7 MB/sec
primitive/cdc                                      1.00    158.8±0.66ms   282.6 MB/sec    1.00    158.4±0.65ms   283.4 MB/sec
primitive/default                                  1.00    117.9±0.40ms   380.8 MB/sec    1.00    118.1±0.63ms   379.8 MB/sec
primitive/parquet_2                                1.00    132.4±0.33ms   339.0 MB/sec    1.01    133.1±0.58ms   337.2 MB/sec
primitive/zstd                                     1.00    146.7±0.34ms   305.9 MB/sec    1.01    148.1±0.77ms   303.0 MB/sec
primitive/zstd_parquet_2                           1.00    165.5±0.58ms   271.1 MB/sec    1.00    166.3±0.52ms   269.9 MB/sec
primitive_all_null/bloom_filter                    2.30     11.6±0.26ms     3.8 GB/sec    1.00      5.1±0.17ms     8.7 GB/sec
primitive_all_null/cdc                             1.38     30.6±0.51ms  1467.3 MB/sec    1.00     22.2±0.32ms  2023.7 MB/sec
primitive_all_null/default                         2.47     11.0±0.25ms     4.0 GB/sec    1.00      4.4±0.13ms     9.9 GB/sec
primitive_all_null/parquet_2                       2.48     11.0±0.26ms     4.0 GB/sec    1.00      4.4±0.11ms     9.9 GB/sec
primitive_all_null/zstd                            2.46     11.1±0.22ms     3.9 GB/sec    1.00      4.5±0.08ms     9.7 GB/sec
primitive_all_null/zstd_parquet_2                  2.45     11.0±0.22ms     4.0 GB/sec    1.00      4.5±0.08ms     9.8 GB/sec
primitive_non_null/bloom_filter                    1.00    111.2±1.30ms   395.8 MB/sec    1.00    111.0±0.68ms   396.3 MB/sec
primitive_non_null/cdc                             1.00     90.0±0.55ms   489.1 MB/sec    1.00     90.1±0.34ms   488.5 MB/sec
primitive_non_null/default                         1.01     67.4±0.26ms   652.9 MB/sec    1.00     66.9±0.18ms   658.0 MB/sec
primitive_non_null/parquet_2                       1.00     89.2±0.30ms   493.2 MB/sec    1.00     88.8±0.18ms   495.6 MB/sec
primitive_non_null/zstd                            1.00    104.6±0.28ms   420.7 MB/sec    1.00    104.1±0.29ms   422.6 MB/sec
primitive_non_null/zstd_parquet_2                  1.06    129.6±1.68ms   339.6 MB/sec    1.00    122.8±0.63ms   358.4 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.54     18.3±0.18ms     2.4 GB/sec    1.00     11.9±0.09ms     3.7 GB/sec
primitive_sparse_99pct_null/cdc                    1.28     37.1±0.56ms  1208.7 MB/sec    1.00     29.1±0.33ms  1544.3 MB/sec
primitive_sparse_99pct_null/default                1.60     16.8±0.08ms     2.6 GB/sec    1.00     10.5±0.04ms     4.2 GB/sec
primitive_sparse_99pct_null/parquet_2              1.59     16.8±0.06ms     2.6 GB/sec    1.00     10.5±0.05ms     4.2 GB/sec
primitive_sparse_99pct_null/zstd                   1.46     20.1±0.08ms     2.2 GB/sec    1.00     13.8±0.05ms     3.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.50     18.6±0.08ms     2.4 GB/sec    1.00     12.5±0.06ms     3.5 GB/sec
string/bloom_filter                                1.11   228.8±25.21ms     2.2 GB/sec    1.00   206.6±16.03ms     2.5 GB/sec
string/cdc                                         1.01    221.6±5.74ms     2.3 GB/sec    1.00    219.6±4.18ms     2.3 GB/sec
string/default                                     1.16   142.7±24.98ms     3.6 GB/sec    1.00   123.6±15.87ms     4.1 GB/sec
string/parquet_2                                   1.01    125.5±0.27ms     4.1 GB/sec    1.00    123.9±0.98ms     4.1 GB/sec
string/zstd                                        1.00    423.9±2.50ms  1236.6 MB/sec    1.09   460.9±11.33ms  1137.4 MB/sec
string/zstd_parquet_2                              1.00    394.3±0.94ms  1329.5 MB/sec    1.03    404.9±8.83ms  1294.9 MB/sec
string_and_binary_view/bloom_filter                1.00     63.9±0.27ms   504.7 MB/sec    1.01     64.5±0.38ms   499.6 MB/sec
string_and_binary_view/cdc                         1.00     58.4±0.15ms   551.9 MB/sec    1.00     58.4±0.30ms   551.9 MB/sec
string_and_binary_view/default                     1.00     48.0±0.11ms   671.7 MB/sec    1.00     48.0±0.22ms   671.4 MB/sec
string_and_binary_view/parquet_2                   1.00     58.8±0.15ms   548.4 MB/sec    1.00     59.0±0.29ms   546.9 MB/sec
string_and_binary_view/zstd                        1.00     84.5±0.14ms   381.8 MB/sec    1.00     84.9±0.25ms   380.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.7±0.14ms   443.5 MB/sec    1.00     72.8±0.28ms   442.9 MB/sec
string_dictionary/bloom_filter                     1.00     89.0±0.77ms     2.9 GB/sec    1.02     90.4±0.27ms     2.9 GB/sec
string_dictionary/cdc                              1.56     84.3±0.74ms     3.1 GB/sec    1.00     53.9±0.45ms     4.8 GB/sec
string_dictionary/default                          1.00     48.7±0.35ms     5.3 GB/sec    1.01     49.2±0.22ms     5.2 GB/sec
string_dictionary/parquet_2                        1.00     53.8±0.21ms     4.8 GB/sec    1.01     54.4±0.24ms     4.7 GB/sec
string_dictionary/zstd                             1.00    208.6±0.77ms  1266.5 MB/sec    1.00    209.2±0.60ms  1262.6 MB/sec
string_dictionary/zstd_parquet_2                   1.00    197.5±0.13ms  1337.1 MB/sec    1.01    198.9±0.30ms  1327.7 MB/sec
string_non_null/bloom_filter                       1.00   249.4±14.94ms     2.1 GB/sec    1.00   250.3±11.34ms     2.0 GB/sec
string_non_null/cdc                                1.00    268.6±9.27ms  1951.1 MB/sec    1.01    270.9±9.06ms  1934.4 MB/sec
string_non_null/default                            1.00   125.7±12.66ms     4.1 GB/sec    1.10   138.9±11.90ms     3.7 GB/sec
string_non_null/parquet_2                          1.00   140.0±11.59ms     3.7 GB/sec    1.01    141.0±7.81ms     3.6 GB/sec
string_non_null/zstd                               1.00    531.0±2.34ms   986.8 MB/sec    1.05    559.7±6.69ms   936.2 MB/sec
string_non_null/zstd_parquet_2                     1.00    505.0±2.21ms  1037.6 MB/sec    1.03    518.6±5.39ms  1010.5 MB/sec
struct_all_null/bloom_filter                       1.00      2.5±0.00ms     6.3 GB/sec    1.00      2.5±0.00ms     6.3 GB/sec
struct_all_null/cdc                                1.00      9.8±0.15ms  1638.0 MB/sec    1.01      9.9±0.09ms  1627.3 MB/sec
struct_all_null/default                            1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.3±0.01ms     7.0 GB/sec
struct_all_null/zstd                               1.00      2.3±0.00ms     6.9 GB/sec    1.00      2.3±0.00ms     6.9 GB/sec
struct_all_null/zstd_parquet_2                     1.00      2.3±0.00ms     6.9 GB/sec    1.00      2.3±0.01ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.00     46.0±0.17ms   347.8 MB/sec    1.01     46.4±0.15ms   344.7 MB/sec
struct_non_null/cdc                                1.00     45.2±0.15ms   353.8 MB/sec    1.00     45.2±0.18ms   353.8 MB/sec
struct_non_null/default                            1.00     31.8±0.09ms   502.9 MB/sec    1.00     31.9±0.11ms   501.7 MB/sec
struct_non_null/parquet_2                          1.00     40.6±0.48ms   394.5 MB/sec    1.00     40.7±0.12ms   393.5 MB/sec
struct_non_null/zstd                               1.00     40.6±0.11ms   394.5 MB/sec    1.00     40.6±0.09ms   393.8 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.5±0.13ms   293.6 MB/sec    1.00     54.7±0.12ms   292.4 MB/sec
struct_sparse_99pct_null/bloom_filter              1.01      7.5±0.09ms     2.1 GB/sec    1.00      7.4±0.03ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.07     15.3±0.12ms  1050.7 MB/sec    1.00     14.3±0.08ms  1128.9 MB/sec
struct_sparse_99pct_null/default                   1.01      6.9±0.09ms     2.3 GB/sec    1.00      6.9±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.01      7.0±0.08ms     2.3 GB/sec    1.00      6.9±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.01      8.3±0.09ms  1944.3 MB/sec    1.00      8.3±0.02ms  1954.2 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.01      7.7±0.09ms     2.0 GB/sec    1.00      7.6±0.02ms     2.1 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1940.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1880.3s
CPU sys 56.6s
Peak spill 0 B

branch

Metric Value
Wall time 1935.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1876.7s
CPU sys 55.7s
Peak spill 0 B

File an issue against this benchmark runner

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RyanJamesStewart, this looks like a nice improvement. Just a few nits.

Comment on lines +648 to +667
// Bulk-fill fast path. Gated on:
// - `len >= 64`: per-call slice/popcount/iterator overhead only
// amortizes on sizable sub-ranges. List/struct paths call
// write_leaf many times with tiny ranges (avg list length 1-5);
// paying any per-call popcount there would regress them. A
// threshold sweep at T={0,16,32,64,128,256} on Ryzen 9 9950X
// shows the regression floor settles by T=32 and the choice of
// 64 gives ~12x margin over avg list length without losing the
// flat-primitive wins.
// - `nulls.null_count() * 2 >= nulls.len()`: cached `null_count()`
// is O(1), so this check is free. We use the buffer-level density
// as a heuristic for the sub-range; for full-array writes (the
// primary target — flat primitive columns) it's exact.
// Note: even when this gate skips the fast path, evaluating the gate
// itself across high-frequency call sites (~10K calls in some list
// benchmarks) is a small structural cost (~+1-2% on list-sparse
// cases). It's the price of having any gate at all on this hot path;
// reducing it further would require hoisting the decision into the
// caller. The wins on the targeted shapes (-35% sparse-primitive,
// -66% all-null primitive) far outweigh it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md#ai-generated-submissions

I think this exposition is better in the PR than the code.

// reducing it further would require hoisting the decision into the
// caller. The wins on the targeted shapes (-35% sparse-primitive,
// -66% all-null primitive) far outweigh it.
if len >= 64 && nulls.null_count() * 2 >= nulls.len() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace 64 with a constant. Documentation for the constant can point to this PR for an explanation as to how 64 was arrived at.

- Extract the `64` threshold into `BULK_FILL_MIN_LEN` const with a
  doc comment pointing to this PR for the sweep rationale.
- Trim the in-code comment block; threshold-sweep rationale moves to
  the PR description per CONTRIBUTING.md guidance on AI-generated
  submissions.
- Refactor the slow-path else branch to mirror the fast-path slice
  idiom: `let range_nulls = nulls.slice(range.start, len);` then
  `range_nulls.iter()` for `def_levels` and
  `range_nulls.valid_indices().map(|i| i + range.start)` for
  `non_null_indices`. Drops the `unsafe value_unchecked` and the
  manual `BitIndexIterator::new` offset arithmetic.
- Drop the now-unused `BitIndexIterator` import.

Behavior unchanged. `cargo test -p parquet --features arrow --lib
arrow_writer` green (136 tests); clippy and fmt clean.
@RyanJamesStewart RyanJamesStewart force-pushed the perf/arrow-parquet-null-heavy branch from a717169 to 4eaed80 Compare May 13, 2026 21:37
@RyanJamesStewart
Copy link
Copy Markdown
Author

Pushed v2. Four changes from review:

  1. Slow-path else branch refactored to slice once and use range_nulls.iter() for def_levels plus range_nulls.valid_indices().map(|i| i + range.start) for non_null_indices, mirroring the fast-path idiom. Drops the unsafe value_unchecked and the manual BitIndexIterator::new offset arithmetic. BitIndexIterator import removed.
  2. 64 extracted into BULK_FILL_MIN_LEN const with a doc comment pointing to this PR for the sweep rationale.
  3. In-code comment block compressed to four lines; threshold-sweep exposition moved into the PR description under a new Threshold (BULK_FILL_MIN_LEN) section.
  4. AI-usage disclosure added near the top of the PR description per the CONTRIBUTING guidance.

Tests, clippy, and fmt clean locally; CI is in flight. The fast-path branch and gate are unchanged, so the targeted bench shape (the latest bot run showed primitive_all_null 2.47x and primitive_sparse_99pct_null 1.60x, list paths back in noise) should hold under the refactor.

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @RyanJamesStewart, looking better. A few suggestions to reduce some code duplication.

Comment thread parquet/src/arrow/arrow_writer/levels.rs
Comment thread parquet/src/arrow/arrow_writer/levels.rs
info.non_null_indices
.extend(range_nulls.valid_indices().map(|i| i + range.start));
} else {
let range_nulls = nulls.slice(range.start, len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then remove this...

Comment on lines +681 to +682
info.non_null_indices
.extend(range_nulls.valid_indices().map(|i| i + range.start));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and remove this (seems a reserve went missing as well).

Comment thread parquet/src/arrow/arrow_writer/levels.rs
RyanJamesStewart added a commit to RyanJamesStewart/arrow-rs that referenced this pull request May 13, 2026
…ulls arm

The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices`
by `len` items but didn't pre-reserve capacity. Add the matching
`reserve(len)` before the `extend(range.clone())` so the no-nulls path
allocates once instead of relying on `Vec`'s amortized growth.

Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4446167227-53-dghhw 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              main                                   perf_arrow-parquet-null-heavy
-----                                              ----                                   -----------------------------
bool/bloom_filter                                  1.01     13.1±0.04ms    19.1 MB/sec    1.00     12.9±0.04ms    19.3 MB/sec
bool/cdc                                           1.00     15.7±0.05ms    16.0 MB/sec    1.00     15.7±0.05ms    15.9 MB/sec
bool/default                                       1.01     11.0±0.03ms    22.7 MB/sec    1.00     10.8±0.03ms    23.1 MB/sec
bool/parquet_2                                     1.01     14.7±0.04ms    17.0 MB/sec    1.00     14.6±0.03ms    17.2 MB/sec
bool/zstd                                          1.01     11.5±0.04ms    21.7 MB/sec    1.00     11.4±0.02ms    22.0 MB/sec
bool/zstd_parquet_2                                1.01     15.1±0.07ms    16.5 MB/sec    1.00     14.9±0.03ms    16.7 MB/sec
bool_non_null/bloom_filter                         1.00      7.0±0.02ms    17.8 MB/sec    1.01      7.1±0.03ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.8±0.03ms    18.5 MB/sec    1.01      6.9±0.04ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    29.3 MB/sec    1.02      4.3±0.03ms    28.9 MB/sec
bool_non_null/parquet_2                            1.01      9.1±0.04ms    13.8 MB/sec    1.00      9.0±0.04ms    13.9 MB/sec
bool_non_null/zstd                                 1.00      4.6±0.02ms    27.2 MB/sec    1.02      4.7±0.03ms    26.7 MB/sec
bool_non_null/zstd_parquet_2                       1.01      9.5±0.04ms    13.2 MB/sec    1.00      9.4±0.05ms    13.3 MB/sec
float_with_nans/bloom_filter                       1.00     92.4±0.38ms   151.5 MB/sec    1.00     92.5±0.46ms   151.3 MB/sec
float_with_nans/cdc                                1.00     81.4±0.21ms   171.9 MB/sec    1.00     81.7±0.30ms   171.3 MB/sec
float_with_nans/default                            1.00     74.1±0.31ms   189.0 MB/sec    1.00     74.1±0.25ms   189.0 MB/sec
float_with_nans/parquet_2                          1.00     94.1±0.57ms   148.8 MB/sec    1.00     94.3±0.43ms   148.4 MB/sec
float_with_nans/zstd                               1.00    111.7±0.25ms   125.4 MB/sec    1.00    111.8±0.29ms   125.2 MB/sec
float_with_nans/zstd_parquet_2                     1.00    131.0±0.44ms   106.9 MB/sec    1.00    131.6±0.46ms   106.4 MB/sec
list_primitive/bloom_filter                        1.00    326.8±0.87ms  1668.6 MB/sec    1.16    380.0±1.55ms  1435.1 MB/sec
list_primitive/cdc                                 1.00    358.7±0.93ms  1520.2 MB/sec    1.16    417.2±2.19ms  1307.2 MB/sec
list_primitive/default                             1.00    246.7±0.85ms     2.2 GB/sec    1.24    305.3±2.01ms  1786.5 MB/sec
list_primitive/parquet_2                           1.00    268.8±0.92ms  2029.1 MB/sec    1.23    329.5±1.29ms  1655.2 MB/sec
list_primitive/zstd                                1.00    497.3±0.92ms  1096.6 MB/sec    1.11    553.7±2.10ms   985.0 MB/sec
list_primitive/zstd_parquet_2                      1.00    491.1±0.38ms  1110.6 MB/sec    1.12    552.1±1.26ms   987.8 MB/sec
list_primitive_non_null/bloom_filter               1.00    430.6±3.84ms  1263.8 MB/sec    1.11   477.5±25.74ms  1139.8 MB/sec
list_primitive_non_null/cdc                        1.01    439.3±8.83ms  1238.8 MB/sec    1.00    435.9±9.65ms  1248.6 MB/sec
list_primitive_non_null/default                    1.00    294.2±2.98ms  1850.0 MB/sec    1.12   328.8±23.62ms  1655.1 MB/sec
list_primitive_non_null/parquet_2                  1.04   310.2±13.88ms  1754.5 MB/sec    1.00   299.0±13.24ms  1820.1 MB/sec
list_primitive_non_null/zstd                       1.03    709.7±7.57ms   766.9 MB/sec    1.00    686.4±4.06ms   792.9 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    666.9±0.69ms   816.0 MB/sec    1.00    667.1±0.48ms   815.8 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.0±0.04ms     3.3 GB/sec    1.12     12.4±0.07ms     3.0 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.5±0.14ms  1663.2 MB/sec    1.05     23.6±0.11ms  1580.1 MB/sec
list_primitive_sparse_99pct_null/default           1.00     10.7±0.04ms     3.4 GB/sec    1.12     12.1±0.06ms     3.0 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     10.8±0.05ms     3.4 GB/sec    1.12     12.1±0.06ms     3.0 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     12.6±0.05ms     2.9 GB/sec    1.11     13.9±0.08ms     2.6 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     10.9±0.05ms     3.4 GB/sec    1.12     12.2±0.06ms     3.0 GB/sec
primitive/bloom_filter                             1.02    154.0±0.48ms   291.4 MB/sec    1.00    151.2±1.13ms   296.8 MB/sec
primitive/cdc                                      1.02    161.6±1.18ms   277.8 MB/sec    1.00    157.7±0.79ms   284.5 MB/sec
primitive/default                                  1.00    118.5±0.52ms   378.7 MB/sec    1.00    118.6±1.13ms   378.5 MB/sec
primitive/parquet_2                                1.00    133.7±0.35ms   335.8 MB/sec    1.00    133.2±1.17ms   336.9 MB/sec
primitive/zstd                                     1.01    148.3±0.40ms   302.5 MB/sec    1.00    146.2±0.46ms   306.9 MB/sec
primitive/zstd_parquet_2                           1.00    166.8±0.35ms   269.1 MB/sec    1.01    168.7±2.41ms   265.9 MB/sec
primitive_all_null/bloom_filter                    2.31     11.6±0.20ms     3.8 GB/sec    1.00      5.0±0.13ms     8.8 GB/sec
primitive_all_null/cdc                             1.38     30.7±0.37ms  1463.9 MB/sec    1.00     22.2±0.50ms  2024.7 MB/sec
primitive_all_null/default                         2.50     11.0±0.25ms     4.0 GB/sec    1.00      4.4±0.10ms     9.9 GB/sec
primitive_all_null/parquet_2                       2.45     10.8±0.07ms     4.0 GB/sec    1.00      4.4±0.13ms     9.9 GB/sec
primitive_all_null/zstd                            2.42     11.0±0.16ms     4.0 GB/sec    1.00      4.6±0.12ms     9.6 GB/sec
primitive_all_null/zstd_parquet_2                  2.44     11.0±0.17ms     4.0 GB/sec    1.00      4.5±0.13ms     9.7 GB/sec
primitive_non_null/bloom_filter                    1.03    120.4±1.49ms   365.5 MB/sec    1.00    117.3±1.12ms   375.0 MB/sec
primitive_non_null/cdc                             1.01     91.3±0.54ms   482.0 MB/sec    1.00     90.2±0.73ms   487.9 MB/sec
primitive_non_null/default                         1.00     69.0±0.31ms   637.6 MB/sec    1.00     68.7±1.24ms   640.1 MB/sec
primitive_non_null/parquet_2                       1.00     90.1±0.28ms   488.2 MB/sec    1.06     95.6±1.92ms   460.3 MB/sec
primitive_non_null/zstd                            1.00    106.7±0.37ms   412.3 MB/sec    1.00    107.1±1.47ms   411.0 MB/sec
primitive_non_null/zstd_parquet_2                  1.06    130.9±1.81ms   336.2 MB/sec    1.00    124.0±1.11ms   354.8 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.52     18.1±0.23ms     2.4 GB/sec    1.00     12.0±0.17ms     3.7 GB/sec
primitive_sparse_99pct_null/cdc                    1.28     37.0±0.41ms  1212.2 MB/sec    1.00     29.0±0.32ms  1548.5 MB/sec
primitive_sparse_99pct_null/default                1.59     16.7±0.10ms     2.6 GB/sec    1.00     10.5±0.07ms     4.2 GB/sec
primitive_sparse_99pct_null/parquet_2              1.59     16.7±0.05ms     2.6 GB/sec    1.00     10.5±0.08ms     4.2 GB/sec
primitive_sparse_99pct_null/zstd                   1.44     20.0±0.14ms     2.2 GB/sec    1.00     13.9±0.16ms     3.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.50     18.6±0.11ms     2.4 GB/sec    1.00     12.4±0.10ms     3.5 GB/sec
string/bloom_filter                                1.05   228.3±26.42ms     2.2 GB/sec    1.00   218.0±21.27ms     2.3 GB/sec
string/cdc                                         1.00    221.2±6.09ms     2.3 GB/sec    1.00    222.0±5.06ms     2.3 GB/sec
string/default                                     1.06   144.6±26.65ms     3.5 GB/sec    1.00   136.2±19.94ms     3.8 GB/sec
string/parquet_2                                   1.01    126.2±0.43ms     4.1 GB/sec    1.00    124.9±0.38ms     4.1 GB/sec
string/zstd                                        1.00    424.2±2.92ms  1235.9 MB/sec    1.04   441.4±19.30ms  1187.7 MB/sec
string/zstd_parquet_2                              1.00    392.8±0.55ms  1334.5 MB/sec    1.02    402.2±3.50ms  1303.5 MB/sec
string_and_binary_view/bloom_filter                1.00     63.6±0.37ms   507.1 MB/sec    1.02     65.2±0.44ms   494.7 MB/sec
string_and_binary_view/cdc                         1.00     58.3±0.18ms   553.5 MB/sec    1.00     58.2±0.14ms   553.8 MB/sec
string_and_binary_view/default                     1.00     47.6±0.14ms   678.1 MB/sec    1.01     48.0±0.14ms   672.5 MB/sec
string_and_binary_view/parquet_2                   1.00     58.5±0.13ms   551.0 MB/sec    1.01     59.1±0.19ms   545.7 MB/sec
string_and_binary_view/zstd                        1.00     84.2±0.15ms   383.2 MB/sec    1.01     84.6±0.15ms   381.1 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.4±0.12ms   445.5 MB/sec    1.01     72.8±0.14ms   442.9 MB/sec
string_dictionary/bloom_filter                     1.00     89.1±0.82ms     2.9 GB/sec    1.02     90.5±0.65ms     2.8 GB/sec
string_dictionary/cdc                              1.42     85.8±0.75ms     3.0 GB/sec    1.00     60.6±2.27ms     4.3 GB/sec
string_dictionary/default                          1.00     48.3±0.33ms     5.3 GB/sec    1.01     48.9±0.72ms     5.3 GB/sec
string_dictionary/parquet_2                        1.00     53.8±0.16ms     4.8 GB/sec    1.00     54.1±0.17ms     4.8 GB/sec
string_dictionary/zstd                             1.00    208.7±0.78ms  1265.4 MB/sec    1.00    209.5±0.66ms  1260.9 MB/sec
string_dictionary/zstd_parquet_2                   1.00    197.5±0.29ms  1337.7 MB/sec    1.01    199.6±0.81ms  1323.5 MB/sec
string_non_null/bloom_filter                       1.08   253.6±16.24ms     2.0 GB/sec    1.00    234.7±6.49ms     2.2 GB/sec
string_non_null/cdc                                1.00    269.6±9.92ms  1943.9 MB/sec    1.00    268.8±9.28ms  1949.8 MB/sec
string_non_null/default                            1.00   128.5±13.53ms     4.0 GB/sec    1.06   136.5±12.94ms     3.7 GB/sec
string_non_null/parquet_2                          1.01   142.5±12.23ms     3.6 GB/sec    1.00   140.7±11.02ms     3.6 GB/sec
string_non_null/zstd                               1.00    529.4±1.66ms   989.8 MB/sec    1.00    529.1±1.60ms   990.3 MB/sec
string_non_null/zstd_parquet_2                     1.01    505.3±2.41ms  1037.0 MB/sec    1.00    502.5±0.54ms  1042.7 MB/sec
struct_all_null/bloom_filter                       1.00      2.5±0.00ms     6.2 GB/sec    1.01      2.5±0.01ms     6.2 GB/sec
struct_all_null/cdc                                1.00      9.9±0.15ms  1636.9 MB/sec    1.00      9.8±0.06ms  1642.6 MB/sec
struct_all_null/default                            1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.00ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.00ms     7.0 GB/sec
struct_all_null/zstd                               1.00      2.3±0.00ms     6.8 GB/sec    1.00      2.3±0.00ms     6.9 GB/sec
struct_all_null/zstd_parquet_2                     1.00      2.3±0.00ms     6.9 GB/sec    1.00      2.3±0.00ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.02     48.4±0.21ms   330.3 MB/sec    1.00     47.7±0.22ms   335.8 MB/sec
struct_non_null/cdc                                1.00     45.5±0.17ms   351.4 MB/sec    1.00     45.5±0.15ms   351.4 MB/sec
struct_non_null/default                            1.01     32.3±0.16ms   495.0 MB/sec    1.00     32.0±0.18ms   499.5 MB/sec
struct_non_null/parquet_2                          1.00     40.8±0.15ms   391.7 MB/sec    1.00     40.8±0.14ms   391.7 MB/sec
struct_non_null/zstd                               1.01     41.1±0.12ms   389.6 MB/sec    1.00     40.8±0.12ms   392.1 MB/sec
struct_non_null/zstd_parquet_2                     1.00     54.9±0.14ms   291.4 MB/sec    1.00     55.0±0.18ms   290.7 MB/sec
struct_sparse_99pct_null/bloom_filter              1.01      7.5±0.04ms     2.1 GB/sec    1.00      7.4±0.04ms     2.1 GB/sec
struct_sparse_99pct_null/cdc                       1.08     15.4±0.11ms  1046.1 MB/sec    1.00     14.3±0.15ms  1130.3 MB/sec
struct_sparse_99pct_null/default                   1.01      6.9±0.03ms     2.3 GB/sec    1.00      6.9±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/parquet_2                 1.01      6.9±0.02ms     2.3 GB/sec    1.00      6.9±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.3±0.04ms  1944.7 MB/sec    1.01      8.3±0.10ms  1932.3 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.1 GB/sec    1.00      7.7±0.03ms     2.1 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1940.4s
Peak memory 6.2 GiB
Avg memory 6.0 GiB
CPU user 1877.4s
CPU sys 57.8s
Peak spill 0 B

branch

Metric Value
Wall time 1955.4s
Peak memory 6.3 GiB
Avg memory 6.1 GiB
CPU user 1893.4s
CPU sys 59.4s
Peak spill 0 B

File an issue against this benchmark runner

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

run benchmark arrow_writer

env:
  BENCH_FILTER: list_prim

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4446798916-54-w9wrx 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (3f69dff) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=list_prim
Results will be posted here when complete


File an issue against this benchmark runner

Hoist the shared work in `write_leaf`'s `Some(nulls)` arm out of the
if/else where it can be done without changing the cost profile:

- `let range_nulls = nulls.slice(range.start, len)`: O(1) metadata
  adjustment, safe to compute once and share between branches.
- `info.non_null_indices.extend(range_nulls.valid_indices()...)`:
  a single iterator consumption after the branches, identical work
  in both paths.

The `range_nulls.null_count()` popcount and `reserve(valid_in_range)`
stay INSIDE the fast-path branch where bulk-fill needs the count
anyway. An earlier revision (59ced5b) hoisted the popcount out and
caused 11-24% regressions on `list_primitive*` benches because
`write_leaf` is called ~10K times with avg range ~5 from list paths;
paying the popcount per call dominated the saved hash probes.

Bench (`list_primitive_sparse_99pct_null/default`, local hardware,
this revision): 9.83 ms, in line with main's 10.7 ms on GKE
c4a-highmem-16, and well below the 12.1 ms the prior dedup
attempt showed.

Behavior unchanged. 136 arrow_writer tests green; clippy + fmt clean.
…ulls arm

The `logical_nulls = None` arm of `write_leaf` extends `non_null_indices`
by `len` items but didn't pre-reserve capacity. Add the matching
`reserve(len)` before the `extend(range.clone())` so the no-nulls path
allocates once instead of relying on `Vec`'s amortized growth.

Behavior unchanged. Caught by @etseidl's review round on PR apache#9967.
@RyanJamesStewart RyanJamesStewart force-pushed the perf/arrow-parquet-null-heavy branch from 3f69dff to df1f323 Compare May 14, 2026 02:32
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              main                                   perf_arrow-parquet-null-heavy
-----                                              ----                                   -----------------------------
list_primitive/bloom_filter                        1.00    362.0±1.28ms  1506.6 MB/sec    1.15    417.1±2.04ms  1307.6 MB/sec
list_primitive/cdc                                 1.00    364.1±9.14ms  1497.8 MB/sec    1.17    425.1±8.23ms  1282.8 MB/sec
list_primitive/default                             1.00    279.8±1.89ms  1949.0 MB/sec    1.21    338.5±2.13ms  1610.9 MB/sec
list_primitive/parquet_2                           1.00    285.1±9.19ms  1912.9 MB/sec    1.24    354.2±9.02ms  1539.9 MB/sec
list_primitive/zstd                                1.00    514.6±3.91ms  1059.9 MB/sec    1.08    555.7±3.21ms   981.4 MB/sec
list_primitive/zstd_parquet_2                      1.00    490.6±2.45ms  1111.6 MB/sec    1.13    555.4±4.52ms   981.9 MB/sec
list_primitive_non_null/bloom_filter               1.02   436.7±15.15ms  1246.2 MB/sec    1.00   426.3±12.86ms  1276.8 MB/sec
list_primitive_non_null/cdc                        1.01    431.9±8.06ms  1260.2 MB/sec    1.00    428.3±5.18ms  1270.8 MB/sec
list_primitive_non_null/default                    1.05   310.3±15.42ms  1754.0 MB/sec    1.00   296.4±12.18ms  1836.3 MB/sec
list_primitive_non_null/parquet_2                  1.00   301.2±19.25ms  1807.2 MB/sec    1.03   310.8±23.41ms  1750.9 MB/sec
list_primitive_non_null/zstd                       1.00   718.7±20.02ms   757.2 MB/sec    1.00   715.5±21.15ms   760.6 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    679.5±2.35ms   800.9 MB/sec    1.01    685.6±5.13ms   793.9 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.0±0.03ms     3.3 GB/sec    1.13     12.4±0.19ms     3.0 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.4±0.07ms  1668.6 MB/sec    1.06     23.7±0.08ms  1578.4 MB/sec
list_primitive_sparse_99pct_null/default           1.00     10.7±0.02ms     3.4 GB/sec    1.13     12.0±0.04ms     3.0 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     10.7±0.02ms     3.4 GB/sec    1.13     12.0±0.04ms     3.0 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     12.5±0.04ms     2.9 GB/sec    1.11     13.9±0.05ms     2.6 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     10.8±0.03ms     3.4 GB/sec    1.13     12.2±0.04ms     3.0 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 655.1s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 611.7s
CPU sys 41.6s
Peak spill 0 B

branch

Metric Value
Wall time 685.2s
Peak memory 6.6 GiB
Avg memory 6.3 GiB
CPU user 643.8s
CPU sys 36.5s
Peak spill 0 B

File an issue against this benchmark runner

…erve

The partial dedup in 695f178 (hoisting `let range_nulls = nulls.slice(...)`
before the gate and `info.non_null_indices.extend(...)` after the if/else)
regressed `list_primitive/default` 21% and `list_primitive_sparse_99pct_null/default`
13% on the GKE c4a-highmem-16 bench. Same operations as the v2 shape, but the
compiler produced different code for the nullable-list path.

Restore the v2 structure (slice declared inside each branch, both extends
inline). Keep the `None`-arm `info.non_null_indices.reserve(len)` from df1f323
since that fixed `list_primitive_non_null/default` 1.12 -> 1.00 and addresses
etseidl's reserve nit substantively.

Bench shapes by revision on c4a-highmem-16:
- v2 (4eaed80): list_primitive/default 1.00, non_null 1.00, sparse_99pct 1.00
- v3 (695f178, full dedup): 1.24, 1.12, 1.12
- v4 (df1f323, partial dedup + reserve): 1.21, 1.00, 1.13
- v5 (this commit): expected back to v2's clean shape with reserve fix retained

136 arrow_writer tests green; fmt + clippy clean.
@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

Sorry @RyanJamesStewart I led you astray. I think the regression is due to the nulls.slice() call in the not-enough-nulls arm. If you revert that to the original, most of the regression goes away (at least on my machine).

                    } else {
                        let nulls = nulls.inner();
                        info.def_levels.extend_from_iter(range.clone().map(|i| {
                            // Safety: range.end was asserted to be in bounds earlier
                            let valid = unsafe { nulls.value_unchecked(i) };
                            max_def_level - (!valid as i16)
                        }));
                        info.non_null_indices.reserve(len);
                        info.non_null_indices.extend(
                            BitIndexIterator::new(nulls.inner(), nulls.offset() + range.start, len)
                                .map(|i| i + range.start),
                        );
                    }

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

run benchmark arrow_writer

env:
  BENCH_FILTER: list_primitive/

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4447708141-59-z6mrc 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (cd166cc) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=list_primitive/
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                            main                                   perf_arrow-parquet-null-heavy
-----                            ----                                   -----------------------------
list_primitive/bloom_filter      1.00    361.9±1.71ms  1507.1 MB/sec    1.18    428.7±2.96ms  1272.1 MB/sec
list_primitive/cdc               1.00   365.1±11.14ms  1493.8 MB/sec    1.17    428.8±6.36ms  1272.0 MB/sec
list_primitive/default           1.00    275.4±1.99ms  1980.1 MB/sec    1.24    341.4±2.58ms  1597.2 MB/sec
list_primitive/parquet_2         1.00    291.1±7.87ms  1873.5 MB/sec    1.25    362.7±3.16ms  1503.6 MB/sec
list_primitive/zstd              1.00    516.3±4.89ms  1056.3 MB/sec    1.12    580.0±4.73ms   940.3 MB/sec
list_primitive/zstd_parquet_2    1.00    492.4±0.58ms  1107.6 MB/sec    1.15    568.5±9.32ms   959.2 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 275.1s
Peak memory 6.4 GiB
Avg memory 6.2 GiB
CPU user 258.0s
CPU sys 14.8s
Peak spill 0 B

branch

Metric Value
Wall time 315.1s
Peak memory 6.4 GiB
Avg memory 6.2 GiB
CPU user 299.5s
CPU sys 14.4s
Peak spill 0 B

File an issue against this benchmark runner

Per etseidl's diagnosis on PR apache#9967: the `nulls.slice()` call in
write_leaf's not-enough-nulls arm is the regression source on
list_primitive paths, not the dedup structure. The list_primitive
paths call write_leaf ~10K times with avg range ~5, where
per-call NullBuffer slice + range_nulls.iter() / valid_indices
overhead dominates.

Restore the v1 (original) slow-arm pattern:
- `let bits = nulls.inner()` once
- `def_levels.extend_from_iter(range.clone().map(...))` with
  `bits.value_unchecked(i)` (safe: range.end asserted in bounds)
- `non_null_indices` via `BitIndexIterator::new(bits.inner(),
  bits.offset() + range.start, len)`

Fast path keeps `nulls.slice()` / `valid_indices()` since bulk-fill
needs them and `len >= BULK_FILL_MIN_LEN` (64) amortizes the cost.

136 arrow_writer tests green; fmt + clippy clean.
@RyanJamesStewart
Copy link
Copy Markdown
Author

Pushed 552c773 with that slow-arm pattern: direct BitIndexIterator on bits.inner() with bits.offset() + range.start, value_unchecked inside the def_levels extend. No nulls.slice() per call on the cold path. Fast path keeps nulls.slice() + valid_indices() since the len >= 64 gate amortizes the slice cost there.

Tests, clippy, fmt clean locally on the branch. Ready when you can retrigger the bench.

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

run benchmark arrow_writer

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4451595571-89-r984l 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              main                                   perf_arrow-parquet-null-heavy
-----                                              ----                                   -----------------------------
bool/bloom_filter                                  1.01     13.1±0.10ms    19.0 MB/sec    1.00     13.0±0.07ms    19.2 MB/sec
bool/cdc                                           1.00     15.8±0.14ms    15.8 MB/sec    1.01     16.0±0.20ms    15.6 MB/sec
bool/default                                       1.01     11.1±0.03ms    22.6 MB/sec    1.00     10.9±0.05ms    22.9 MB/sec
bool/parquet_2                                     1.00     14.8±0.04ms    16.9 MB/sec    1.00     14.7±0.07ms    17.0 MB/sec
bool/zstd                                          1.01     11.6±0.06ms    21.6 MB/sec    1.00     11.5±0.03ms    21.7 MB/sec
bool/zstd_parquet_2                                1.00     15.1±0.04ms    16.6 MB/sec    1.01     15.2±0.07ms    16.5 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.06ms    17.7 MB/sec
bool_non_null/cdc                                  1.00      6.9±0.09ms    18.2 MB/sec    1.01      6.9±0.15ms    18.0 MB/sec
bool_non_null/default                              1.00      4.3±0.03ms    29.2 MB/sec    1.00      4.3±0.02ms    29.2 MB/sec
bool_non_null/parquet_2                            1.01      9.2±0.05ms    13.7 MB/sec    1.00      9.1±0.05ms    13.8 MB/sec
bool_non_null/zstd                                 1.00      4.6±0.03ms    27.0 MB/sec    1.00      4.7±0.02ms    26.9 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.03ms    13.2 MB/sec    1.00      9.5±0.04ms    13.2 MB/sec
float_with_nans/bloom_filter                       1.00     92.3±0.26ms   151.7 MB/sec    1.05     96.8±3.31ms   144.6 MB/sec
float_with_nans/cdc                                1.00     81.7±1.16ms   171.3 MB/sec    1.01     82.6±1.19ms   169.5 MB/sec
float_with_nans/default                            1.00     75.2±1.70ms   186.1 MB/sec    1.03     77.2±2.06ms   181.4 MB/sec
float_with_nans/parquet_2                          1.01     96.6±1.05ms   144.9 MB/sec    1.00     95.7±1.95ms   146.2 MB/sec
float_with_nans/zstd                               1.00    113.5±0.89ms   123.3 MB/sec    1.01    114.4±1.65ms   122.3 MB/sec
float_with_nans/zstd_parquet_2                     1.00    134.0±3.04ms   104.5 MB/sec    1.01    135.6±1.86ms   103.2 MB/sec
list_primitive/bloom_filter                        1.00   329.9±15.07ms  1653.4 MB/sec    1.02   334.9±11.50ms  1628.2 MB/sec
list_primitive/cdc                                 1.01    360.6±5.53ms  1512.4 MB/sec    1.00    358.4±4.23ms  1521.8 MB/sec
list_primitive/default                             1.01    250.4±4.11ms     2.1 GB/sec    1.00    248.3±3.29ms     2.1 GB/sec
list_primitive/parquet_2                           1.01    270.4±1.93ms  2017.0 MB/sec    1.00    268.0±2.08ms  2034.8 MB/sec
list_primitive/zstd                                1.00    503.2±6.40ms  1083.9 MB/sec    1.00    502.4±7.10ms  1085.5 MB/sec
list_primitive/zstd_parquet_2                      1.00    494.8±3.35ms  1102.2 MB/sec    1.00    494.0±3.54ms  1104.1 MB/sec
list_primitive_non_null/bloom_filter               1.00   426.4±18.19ms  1276.2 MB/sec    1.05   445.9±22.27ms  1220.6 MB/sec
list_primitive_non_null/cdc                        1.00   442.2±10.60ms  1230.6 MB/sec    1.00   441.1±11.33ms  1234.0 MB/sec
list_primitive_non_null/default                    1.00    292.0±9.91ms  1863.9 MB/sec    1.03    299.8±8.67ms  1815.2 MB/sec
list_primitive_non_null/parquet_2                  1.00    316.7±5.79ms  1718.7 MB/sec    1.02   324.4±17.07ms  1677.5 MB/sec
list_primitive_non_null/zstd                       1.00   712.5±12.94ms   763.9 MB/sec    1.01   721.2±19.01ms   754.7 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    685.5±7.87ms   793.9 MB/sec    1.02    699.0±7.44ms   778.6 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.0±0.03ms     3.3 GB/sec    1.06     11.7±0.06ms     3.1 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.8±0.31ms  1635.6 MB/sec    1.01     23.0±0.28ms  1624.9 MB/sec
list_primitive_sparse_99pct_null/default           1.02     11.2±0.28ms     3.3 GB/sec    1.00     10.9±0.18ms     3.3 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     10.8±0.15ms     3.4 GB/sec    1.03     11.2±0.32ms     3.3 GB/sec
list_primitive_sparse_99pct_null/zstd              1.01     13.1±0.04ms     2.8 GB/sec    1.00     12.9±0.29ms     2.8 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.02     11.2±0.29ms     3.3 GB/sec    1.00     11.0±0.19ms     3.3 GB/sec
primitive/bloom_filter                             1.00    150.7±2.22ms   297.9 MB/sec    1.01    152.6±4.72ms   294.0 MB/sec
primitive/cdc                                      1.00    161.2±1.65ms   278.4 MB/sec    1.00    161.3±1.87ms   278.3 MB/sec
primitive/default                                  1.00    118.1±2.17ms   380.0 MB/sec    1.02    120.5±0.85ms   372.4 MB/sec
primitive/parquet_2                                1.01    134.9±1.36ms   332.7 MB/sec    1.00    133.8±1.55ms   335.4 MB/sec
primitive/zstd                                     1.00    148.8±2.81ms   301.5 MB/sec    1.00    149.5±1.70ms   300.2 MB/sec
primitive/zstd_parquet_2                           1.00    166.8±1.61ms   269.0 MB/sec    1.01    168.5±2.26ms   266.3 MB/sec
primitive_all_null/bloom_filter                    2.13     11.5±0.13ms     3.8 GB/sec    1.00      5.4±0.25ms     8.1 GB/sec
primitive_all_null/cdc                             1.41     30.9±0.39ms  1452.1 MB/sec    1.00     22.0±0.28ms  2041.7 MB/sec
primitive_all_null/default                         2.38     11.0±0.25ms     4.0 GB/sec    1.00      4.6±0.21ms     9.5 GB/sec
primitive_all_null/parquet_2                       2.45     11.0±0.20ms     4.0 GB/sec    1.00      4.5±0.17ms     9.8 GB/sec
primitive_all_null/zstd                            2.49     11.2±0.28ms     3.9 GB/sec    1.00      4.5±0.07ms     9.8 GB/sec
primitive_all_null/zstd_parquet_2                  2.40     11.1±0.13ms     4.0 GB/sec    1.00      4.6±0.15ms     9.5 GB/sec
primitive_non_null/bloom_filter                    1.00    110.5±1.24ms   398.0 MB/sec    1.06    117.5±4.13ms   374.3 MB/sec
primitive_non_null/cdc                             1.00     90.0±0.49ms   489.0 MB/sec    1.02     91.6±2.06ms   480.4 MB/sec
primitive_non_null/default                         1.00     70.1±1.83ms   627.6 MB/sec    1.00     69.9±1.06ms   629.3 MB/sec
primitive_non_null/parquet_2                       1.02     91.0±0.39ms   483.3 MB/sec    1.00     89.2±0.17ms   493.5 MB/sec
primitive_non_null/zstd                            1.00    105.6±1.92ms   416.8 MB/sec    1.02    107.5±1.15ms   409.5 MB/sec
primitive_non_null/zstd_parquet_2                  1.05    132.2±2.85ms   332.9 MB/sec    1.00    125.4±1.29ms   350.9 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.48     19.4±0.70ms     2.3 GB/sec    1.00     13.1±0.13ms     3.3 GB/sec
primitive_sparse_99pct_null/cdc                    1.30     38.1±0.29ms  1178.8 MB/sec    1.00     29.3±0.50ms  1529.8 MB/sec
primitive_sparse_99pct_null/default                1.60     17.3±0.16ms     2.5 GB/sec    1.00     10.9±0.32ms     4.0 GB/sec
primitive_sparse_99pct_null/parquet_2              1.57     16.8±0.24ms     2.6 GB/sec    1.00     10.7±0.27ms     4.1 GB/sec
primitive_sparse_99pct_null/zstd                   1.41     20.1±0.24ms     2.2 GB/sec    1.00     14.3±0.44ms     3.1 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.53     19.2±0.23ms     2.3 GB/sec    1.00     12.5±0.05ms     3.5 GB/sec
string/bloom_filter                                1.06   239.4±33.57ms     2.1 GB/sec    1.00   225.4±17.26ms     2.3 GB/sec
string/cdc                                         1.00    223.0±8.52ms     2.3 GB/sec    1.01    225.0±6.91ms     2.3 GB/sec
string/default                                     1.20   147.2±26.45ms     3.5 GB/sec    1.00   123.1±15.53ms     4.2 GB/sec
string/parquet_2                                   1.00    126.4±2.12ms     4.0 GB/sec    1.00    126.9±3.82ms     4.0 GB/sec
string/zstd                                        1.00    428.6±7.91ms  1223.2 MB/sec    1.10   470.6±11.92ms  1114.1 MB/sec
string/zstd_parquet_2                              1.00    396.3±2.38ms  1322.8 MB/sec    1.03   406.7±10.57ms  1289.0 MB/sec
string_and_binary_view/bloom_filter                1.03     67.3±1.75ms   479.3 MB/sec    1.00     65.1±0.14ms   495.4 MB/sec
string_and_binary_view/cdc                         1.00     59.4±0.55ms   542.6 MB/sec    1.00     59.2±0.96ms   544.9 MB/sec
string_and_binary_view/default                     1.00     48.5±0.77ms   665.4 MB/sec    1.02     49.2±1.06ms   655.3 MB/sec
string_and_binary_view/parquet_2                   1.00     59.1±0.92ms   545.3 MB/sec    1.02     60.2±0.51ms   535.9 MB/sec
string_and_binary_view/zstd                        1.00     85.2±1.65ms   378.7 MB/sec    1.01     86.2±1.67ms   374.0 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.7±0.87ms   443.7 MB/sec    1.01     73.5±1.46ms   438.5 MB/sec
string_dictionary/bloom_filter                     1.13   109.7±10.08ms     2.4 GB/sec    1.00     96.7±6.66ms     2.7 GB/sec
string_dictionary/cdc                              1.26     73.7±1.93ms     3.5 GB/sec    1.00     58.6±3.26ms     4.4 GB/sec
string_dictionary/default                          1.21     64.9±2.65ms     4.0 GB/sec    1.00     53.5±0.96ms     4.8 GB/sec
string_dictionary/parquet_2                        1.18     66.2±1.81ms     3.9 GB/sec    1.00     56.0±0.61ms     4.6 GB/sec
string_dictionary/zstd                             1.02    217.1±4.47ms  1216.6 MB/sec    1.00    213.5±5.13ms  1237.2 MB/sec
string_dictionary/zstd_parquet_2                   1.00    200.3±2.11ms  1318.4 MB/sec    1.00    200.8±1.36ms  1315.2 MB/sec
string_non_null/bloom_filter                       1.00   261.6±20.52ms  2003.4 MB/sec    1.04   271.4±20.70ms  1930.6 MB/sec
string_non_null/cdc                                1.00    260.1±4.64ms  2014.7 MB/sec    1.06    275.6±9.63ms  1901.6 MB/sec
string_non_null/default                            1.00   140.1±15.44ms     3.7 GB/sec    1.01   142.1±16.40ms     3.6 GB/sec
string_non_null/parquet_2                          1.00    149.5±2.53ms     3.4 GB/sec    1.13    169.0±6.49ms     3.0 GB/sec
string_non_null/zstd                               1.05   600.0±27.35ms   873.3 MB/sec    1.00   571.6±13.71ms   916.7 MB/sec
string_non_null/zstd_parquet_2                     1.01   528.6±14.49ms   991.4 MB/sec    1.00    524.8±6.00ms   998.4 MB/sec
struct_all_null/bloom_filter                       1.00      2.5±0.03ms     6.2 GB/sec    1.01      2.6±0.05ms     6.1 GB/sec
struct_all_null/cdc                                1.06      9.9±0.21ms  1635.3 MB/sec    1.00      9.3±0.09ms  1739.3 MB/sec
struct_all_null/default                            1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.3±0.01ms     7.0 GB/sec
struct_all_null/parquet_2                          1.00      2.3±0.00ms     7.0 GB/sec    1.00      2.2±0.01ms     7.0 GB/sec
struct_all_null/zstd                               1.00      2.3±0.00ms     6.8 GB/sec    1.00      2.3±0.01ms     6.9 GB/sec
struct_all_null/zstd_parquet_2                     1.00      2.3±0.00ms     6.9 GB/sec    1.00      2.3±0.01ms     6.9 GB/sec
struct_non_null/bloom_filter                       1.03     48.8±0.95ms   328.1 MB/sec    1.00     47.4±1.12ms   337.7 MB/sec
struct_non_null/cdc                                1.00     45.9±0.53ms   348.9 MB/sec    1.00     46.1±0.45ms   347.4 MB/sec
struct_non_null/default                            1.00     32.0±0.10ms   499.9 MB/sec    1.02     32.7±0.40ms   489.2 MB/sec
struct_non_null/parquet_2                          1.02     41.9±0.53ms   382.1 MB/sec    1.00     41.1±0.80ms   389.2 MB/sec
struct_non_null/zstd                               1.03     42.3±0.97ms   378.3 MB/sec    1.00     41.1±0.57ms   389.3 MB/sec
struct_non_null/zstd_parquet_2                     1.01     55.9±1.10ms   286.1 MB/sec    1.00     55.1±0.43ms   290.1 MB/sec
struct_sparse_99pct_null/bloom_filter              1.01      7.9±0.35ms     2.0 GB/sec    1.00      7.8±0.38ms     2.0 GB/sec
struct_sparse_99pct_null/cdc                       1.04     15.4±0.06ms  1049.0 MB/sec    1.00     14.8±0.07ms  1088.6 MB/sec
struct_sparse_99pct_null/default                   1.04      7.3±0.11ms     2.2 GB/sec    1.00      7.0±0.18ms     2.2 GB/sec
struct_sparse_99pct_null/parquet_2                 1.05      7.2±0.15ms     2.2 GB/sec    1.00      6.9±0.02ms     2.3 GB/sec
struct_sparse_99pct_null/zstd                      1.00      8.4±0.16ms  1929.6 MB/sec    1.05      8.7±0.04ms  1844.7 MB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      7.7±0.02ms     2.1 GB/sec    1.01      7.7±0.02ms     2.0 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1960.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1888.6s
CPU sys 68.8s
Peak spill 0 B

branch

Metric Value
Wall time 1945.4s
Peak memory 6.6 GiB
Avg memory 6.4 GiB
CPU user 1880.1s
CPU sys 62.2s
Peak spill 0 B

File an issue against this benchmark runner

@etseidl
Copy link
Copy Markdown
Contributor

etseidl commented May 14, 2026

run benchmark arrow_writer

env:
  BENCH_FILTER: string/zstd_parquet_2|string_non_null/parquet_2

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4452362004-93-zfwqn 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing perf/arrow-parquet-null-heavy (552c773) to 48fa8a7 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=string/zstd_parquet_2|string_non_null/parquet_2
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                        main                                   perf_arrow-parquet-null-heavy
-----                        ----                                   -----------------------------
string/zstd_parquet_2        1.00    394.3±0.76ms  1329.6 MB/sec    1.00    394.2±0.76ms  1329.8 MB/sec
string_non_null/parquet_2    1.11   144.8±11.49ms     3.5 GB/sec    1.00    130.8±2.77ms     3.9 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 80.0s
Peak memory 6.5 GiB
Avg memory 6.1 GiB
CPU user 75.1s
CPU sys 4.2s
Peak spill 0 B

branch

Metric Value
Wall time 80.0s
Peak memory 6.5 GiB
Avg memory 6.1 GiB
CPU user 74.6s
CPU sys 2.0s
Peak spill 0 B

File an issue against this benchmark runner

Copy link
Copy Markdown
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benches look good now. Thanks for working with me @RyanJamesStewart.

@HippoBaro any last thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants