You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add wide-schema benchmark suite for measuring per-file metadata overhead
Adds a new sql_benchmarks suite that isolates the wide-schema scan
overhead in selective parquet queries: the regime where most of the
work is loading footers / column-chunk metadata rather than reading
row data, and that cost scales with the number of column chunks in
the dataset rather than with the number of columns the query touches.
The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP):
- wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload
- narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal
baseline, only meaningful as a comparison point
Both share row count, file count, and per-file row-group structure;
only schema width differs. All 4 queries run on both subgroups so
every wide number has a directly comparable narrow baseline.
A new gen_wide_data binary synthesizes both datasets in ~60 s with no
external data source. The 8-column base schema (id, value, count, ts,
category, flag, status, text) carries deterministic data; copies 2..N
from the suffix-renamed widening are zero-filled (zero rather than
null since reader-side null-array shortcuts mute the slowdown by
~35 %).
Query coverage:
- Q01: filter + project + ORDER BY + LIMIT (TopK shortcut)
- Q02: project 1 column with tight filter + LIMIT 1
- Q03: tight filter + small projection, no sort
- Q04: two low-cardinality string filters + a non-stat-prunable
modulo predicate for tight selectivity (~0.005 % match rate),
project two columns, no LIMIT or ORDER BY
For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown
wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x
while bloom_filter_eval_time and statistics_eval_time stay flat.
bench.sh adds:
- data wide_schema: synthesizes both wide and narrow datasets
- run wide_schema: runs the wide subgroup, then the narrow
baseline subgroup, for query-by-query comparison
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0 commit comments