You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf(parquet): O(1) estimated_value_bytes for byte arrays with contiguous indices
The previous patch made `ColumnValueEncoder::estimated_value_bytes` walk
every value to sum byte lengths, which added a measurable ~5 % to the
short-string write bench (1M × 8 B strings) because every chunk did a
~1024-entry loop calling `ArrayAccessor::value(idx).as_ref().len()`.
For the simple offset-buffer byte array types (Utf8 / LargeUtf8 / Binary /
LargeBinary), detect contiguous-and-sorted indices — true for every
non-null column written via `non_null_indices` — and compute the total
payload size as one subtraction on `value_offsets()`. For sparse
indices in the same family of types, look lengths up via the offsets
buffer directly rather than going through `ArrayAccessor::value`.
View / fixed-size / dictionary arrays keep the existing per-value walk
via `ArrayAccessor`. Dictionary-encoded data isn't on the hot path in
practice because the writer's `has_dictionary()` short-circuits
`estimated_value_bytes` while parquet dictionary encoding is active.
Bench delta after this change (5-run medians, `arrow_writer` bench):
- short_string_non_null/default (1M × 8 B): ±0 % (was +5–8 %)
- large_string_non_null/default (1024 × 256 KiB): +1 % (was +3 %)
- string_non_null/default (1M random Utf8/LargeUtf8): −2 % (was +2 %)
- string_dictionary/default: ±0 % (was +1 %)
All other benches within ±1 % of main.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0 commit comments