Skip to content

Commit ba4884f

Browse files
adriangbclaude
andcommitted
Add wide-schema benchmark suite for measuring per-file metadata overhead
Adds two new sql_benchmarks suites that isolate the wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query touches. The new wide_schema and narrow_schema suites run identical SQL against two synthesized parquet datasets that share row count, file count, and per-file row-group structure but differ only in schema width: - wide: 1024 cols x 256 files x 50k rows (~225 MB) - narrow: 8 cols x 256 files x 50k rows (~110 MB) A new gen_wide_data binary synthesizes both in ~60 s with no external data source. The 8-column base schema (id, value, count, ts, category, flag, status, text) carries deterministic data; copies 2..N from the suffix-renamed widening are zero-filled (zero rather than null since reader-side null-array shortcuts mute the slowdown by ~35 %). Headline query Q13 isolates the overhead clearly: filter on two low-cardinality string columns plus a non-stat-prunable modulo predicate for ~0.005 % selectivity, project two columns, no LIMIT or ORDER BY. Cold-start datafusion-cli shows ~15x slowdown wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x while bloom_filter_eval_time and statistics_eval_time stay flat. bench.sh adds: - data wide_schema: synthesizes both wide and narrow datasets - run wide_schema: small-projection queries on wide - run narrow_schema: same SQL on narrow (control) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9bbc28b commit ba4884f

20 files changed

Lines changed: 757 additions & 0 deletions

benchmarks/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,50 @@ This benchmarks is derived from the [TPC-H][1] version
620620
[2]: https://github.com/databricks/tpch-dbgen.git,
621621
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
622622

623+
## Wide-schema benchmark
624+
625+
A pair of benchmark suites for measuring the per-file metadata
626+
overhead of a wide schema in selective parquet scans — the regime
627+
where most of the work is loading footers / column-chunk metadata
628+
rather than reading row data, and that cost scales linearly with the
629+
number of column chunks in the dataset rather than with the number of
630+
columns the query references.
631+
632+
The data preparation step (`gen_wide_data`) synthesizes a generic
633+
8-column base schema (`id`, `value`, `count`, `ts`, `category`,
634+
`flag`, `status`, `text`) with deterministic data, then replicates it
635+
128× via suffix renaming (`_2`, `_3`, …) for 1024 columns total —
636+
written across 256 files at 50 k rows per file with one row group per
637+
file and ZSTD(1) compression. Copies 2..128 are zero-filled arrays so
638+
the schema is wide (every column still has its own footer / page
639+
index / column-chunk metadata) but the on-disk size stays around
640+
225 MB. A companion narrow dataset is written the same way without
641+
the suffix copies — same row count, same file count, same per-file
642+
row-group shape, just the 8 base columns. The only variable between
643+
wide and narrow is schema width, which is what controls the per-file
644+
metadata overhead.
645+
646+
```shell
647+
./benchmarks/bench.sh data wide_schema # synthesizes wide (1024 cols × 256 files) + narrow (8 cols × 256 files), ~60 s, ~335 MB
648+
./benchmarks/bench.sh run wide_schema # small-projection queries on the wide dataset
649+
./benchmarks/bench.sh run narrow_schema # same SQL on the narrow dataset (control)
650+
```
651+
652+
The queries are deliberately small-projection (touch ≤ 4 columns) so
653+
the wide-schema overhead is the dominant signal. Headline query is
654+
`Q13` — filter on two low-cardinality string columns plus a
655+
non-stat-prunable modulo predicate for tight selectivity, projecting
656+
two columns, no `LIMIT` or `ORDER BY`. Other queries cover predicates
657+
on duplicated columns far from the start of the schema (`Q02`), TopK
658+
(`Q01`/`Q11`), tight selectivity with limit (`Q03`/`Q10`), and an
659+
`expect_plan` regression guard for projection pushdown (`Q12`).
660+
661+
Compare `wide_schema` and `narrow_schema` Criterion outputs
662+
query-by-query for the slowdown ratio. For cold-start measurements
663+
that include planner setup (the regime where this overhead is most
664+
visible), invoke `datafusion-cli` directly against
665+
`data/wide_schema/{wide,narrow}/`.
666+
623667
## TPCDS
624668

625669
Run the tpcds benchmark.

benchmarks/bench.sh

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ sort_tpch: Benchmark of sorting speed for end-to-end sort queries o
100100
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
101101
topk_tpch: Benchmark of top-k (sorting with limit) queries on TPC-H dataset (SF=1)
102102
external_aggr: External aggregation benchmark on TPC-H dataset (SF=1)
103+
wide_schema: Small-projection queries on a wide synthetic dataset (1024 cols × 256 files) — measures per-file metadata overhead
104+
narrow_schema: The same queries against an 8-col narrow dataset (same row/file/group shape) — wide-vs-narrow control
103105
104106
# ClickBench Benchmarks
105107
clickbench_1: ClickBench queries against a single parquet file
@@ -239,6 +241,9 @@ main() {
239241
tpch_csv10)
240242
data_tpch "10" "csv"
241243
;;
244+
wide_schema|narrow_schema)
245+
data_wide_schema
246+
;;
242247
tpcds)
243248
data_tpcds
244249
;;
@@ -444,6 +449,12 @@ main() {
444449
tpch_mem10)
445450
run_tpch_mem "10"
446451
;;
452+
wide_schema)
453+
run_wide_schema
454+
;;
455+
narrow_schema)
456+
run_narrow_schema
457+
;;
447458
tpcds)
448459
run_tpcds
449460
;;
@@ -698,6 +709,71 @@ run_tpch() {
698709
bash -c "$SQL_CARGO_COMMAND"
699710
}
700711

712+
# Synthesizes two parquet datasets used to measure per-file metadata
713+
# overhead of a wide schema:
714+
#
715+
# - data/wide_schema/wide/ 1024-col events × 256 files (~225 MB)
716+
# - data/wide_schema/narrow/ 8-col events × 256 files (~110 MB)
717+
#
718+
# Both share row count, file count, and per-file row-group shape; only
719+
# schema width differs. No external data source required — gen_wide_data
720+
# synthesizes everything from scratch in ~60 s.
721+
data_wide_schema() {
722+
NUM_FILES=256
723+
ROWS_PER_FILE=50000
724+
WIDTH_FACTOR=128
725+
726+
DST_DIR="${DATA_DIR}/wide_schema"
727+
WIDE_DIR="${DST_DIR}/wide"
728+
NARROW_DIR="${DST_DIR}/narrow"
729+
730+
if [ -d "${WIDE_DIR}" ] && [ "$(ls -A "${WIDE_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
731+
echo " wide parquet exists (${WIDE_DIR})."
732+
else
733+
mkdir -p "${WIDE_DIR}"
734+
echo " synthesizing wide -> ${WIDE_DIR} (factor ${WIDTH_FACTOR}, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
735+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
736+
--dst-dir "${WIDE_DIR}" \
737+
--width-factor ${WIDTH_FACTOR} \
738+
--num-files ${NUM_FILES} \
739+
--rows-per-file ${ROWS_PER_FILE}
740+
fi
741+
742+
if [ -d "${NARROW_DIR}" ] && [ "$(ls -A "${NARROW_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
743+
echo " narrow parquet exists (${NARROW_DIR})."
744+
else
745+
mkdir -p "${NARROW_DIR}"
746+
echo " synthesizing narrow -> ${NARROW_DIR} (8 base cols, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
747+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
748+
--dst-dir "${NARROW_DIR}" \
749+
--width-factor 1 \
750+
--num-files ${NUM_FILES} \
751+
--rows-per-file ${ROWS_PER_FILE}
752+
fi
753+
}
754+
755+
# Runs the wide_schema benchmark (small-projection queries on the wide dataset).
756+
run_wide_schema() {
757+
echo "Running wide_schema benchmark..."
758+
759+
debug_run env BENCH_NAME=wide_schema \
760+
PREFER_HASH_JOIN="${PREFER_HASH_JOIN}" \
761+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
762+
${QUERY:+BENCH_QUERY="${QUERY}"} \
763+
bash -c "$SQL_CARGO_COMMAND"
764+
}
765+
766+
# Runs the same SQL against the narrow dataset — wide-vs-narrow control.
767+
run_narrow_schema() {
768+
echo "Running narrow_schema (baseline) benchmark..."
769+
770+
debug_run env BENCH_NAME=narrow_schema \
771+
PREFER_HASH_JOIN="${PREFER_HASH_JOIN}" \
772+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
773+
${QUERY:+BENCH_QUERY="${QUERY}"} \
774+
bash -c "$SQL_CARGO_COMMAND"
775+
}
776+
701777
# Runs the tpch in memory (needs tpch parquet data)
702778
run_tpch_mem() {
703779
SCALE_FACTOR=$1

benchmarks/sql_benchmarks/README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,8 @@ in the community:
4141
| `taxi` | NYC taxi dataset benchmark |
4242
| `tpcds` | TPC‑DS queries |
4343
| `tpch` | TPC‑H queries |
44+
| `wide_schema` | Small-projection queries on a wide (1024-col, 256-file) synthetic dataset |
45+
| `narrow_schema` | Same queries on an 8-col, 256-file dataset — wide-vs-narrow control |
4446

4547
# Running Benchmarks
4648

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
-- Companion to wide_schema/Q01 — same SQL against the narrow events
2+
-- dataset (8 cols vs 1024). Apples-to-apples baseline for measuring
3+
-- wide-schema metadata overhead.
4+
5+
name Q01
6+
group narrow_schema
7+
8+
init sql_benchmarks/wide_schema/init/set_config.sql
9+
10+
load sql_benchmarks/wide_schema/init/load_narrow.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE category = 'c0'
21+
AND flag = 'f0'
22+
AND status = 's0'
23+
ORDER BY ts DESC
24+
LIMIT 100;
25+
26+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
-- Companion to wide_schema/Q03 — same SQL against the narrow events
2+
-- dataset (8 cols vs 1024). Apples-to-apples baseline for measuring
3+
-- wide-schema metadata overhead.
4+
5+
name Q03
6+
group narrow_schema
7+
8+
init sql_benchmarks/wide_schema/init/set_config.sql
9+
10+
load sql_benchmarks/wide_schema/init/load_narrow.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT value
19+
FROM events
20+
WHERE id = 12345
21+
LIMIT 1;
22+
23+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
-- Companion to wide_schema/Q10 — same SQL against the narrow events
2+
-- dataset (8 cols vs 1024). Apples-to-apples baseline for measuring
3+
-- wide-schema metadata overhead.
4+
5+
name Q10
6+
group narrow_schema
7+
8+
init sql_benchmarks/wide_schema/init/set_config.sql
9+
10+
load sql_benchmarks/wide_schema/init/load_narrow.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE id = 12345
21+
AND category = 'c0';
22+
23+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
-- Companion to wide_schema/Q13: same SQL, same row/file/row-group
2+
-- shape, only the per-file schema width differs. Apples-to-apples
3+
-- baseline for measuring wide-schema metadata overhead.
4+
5+
name Q13
6+
group narrow_schema
7+
8+
init sql_benchmarks/wide_schema/init/set_config.sql
9+
10+
load sql_benchmarks/wide_schema/init/load_narrow.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts
19+
FROM events
20+
WHERE category = 'c0'
21+
AND flag = 'f0'
22+
AND id % 1000 = 0;
23+
24+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
-- Filter on three low-cardinality columns, project four columns,
2+
-- ORDER BY + LIMIT (TopK shortcut).
3+
4+
name Q01
5+
group wide_schema
6+
7+
init sql_benchmarks/wide_schema/init/set_config.sql
8+
9+
load sql_benchmarks/wide_schema/init/load_wide.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT id, ts, value, text
18+
FROM events
19+
WHERE category = 'c0'
20+
AND flag = 'f0'
21+
AND status = 's0'
22+
ORDER BY ts DESC
23+
LIMIT 100;
24+
25+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
-- Same shape as Q01 but filtering on duplicated columns far from the
2+
-- start of the schema (suffix copies _10). Tests whether planning /
3+
-- pruning cost depends on column position.
4+
5+
name Q02
6+
group wide_schema
7+
8+
init sql_benchmarks/wide_schema/init/set_config.sql
9+
10+
load sql_benchmarks/wide_schema/init/load_wide.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE category_10 = 'c0'
21+
AND flag_10 = 'f0'
22+
AND status_10 = 's0'
23+
ORDER BY ts DESC
24+
LIMIT 100;
25+
26+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- Project 1 column with a very tight filter. Stresses minimum-
2+
-- projection pushdown over a wide schema.
3+
4+
name Q03
5+
group wide_schema
6+
7+
init sql_benchmarks/wide_schema/init/set_config.sql
8+
9+
load sql_benchmarks/wide_schema/init/load_wide.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT value
18+
FROM events
19+
WHERE id = 12345
20+
LIMIT 1;
21+
22+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql

0 commit comments

Comments
 (0)