Skip to content

Commit abf3a94

Browse files
adriangbclaude
andcommitted
Add wide-schema benchmark suite for measuring per-file metadata overhead
Adds a new sql_benchmarks suite that isolates the wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query touches. The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP): - wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload - narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal baseline, only meaningful as a comparison point Both share row count, file count, and per-file row-group structure; only schema width differs. Most queries (Q01/Q03/Q10/Q13) run against both subgroups; Q02/Q08/Q11/Q12 reference suffix-renamed columns that only exist in the wide dataset and hardcode subgroup=wide. A new gen_wide_data binary synthesizes both datasets in ~60 s with no external data source. The 8-column base schema (id, value, count, ts, category, flag, status, text) carries deterministic data; copies 2..N from the suffix-renamed widening are zero-filled (zero rather than null since reader-side null-array shortcuts mute the slowdown by ~35 %). Headline query Q13 isolates the overhead clearly: filter on two low-cardinality string columns plus a non-stat-prunable modulo predicate for ~0.005 % selectivity, project two columns, no LIMIT or ORDER BY. Cold-start datafusion-cli shows ~15x slowdown wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x while bloom_filter_eval_time and statistics_eval_time stay flat. bench.sh adds: - data wide_schema: synthesizes both wide and narrow datasets - run wide_schema: runs the wide subgroup, then the narrow baseline subgroup, for query-by-query comparison Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 9bbc28b commit abf3a94

14 files changed

Lines changed: 655 additions & 0 deletions

File tree

benchmarks/README.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,65 @@ This benchmarks is derived from the [TPC-H][1] version
620620
[2]: https://github.com/databricks/tpch-dbgen.git,
621621
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
622622

623+
## Wide-schema benchmark
624+
625+
`wide_schema` measures the per-file metadata overhead of a wide schema
626+
in selective parquet scans — the regime where most of the work is
627+
loading footers / column-chunk metadata rather than reading row data,
628+
and that cost scales linearly with the number of column chunks in the
629+
dataset rather than with the number of columns the query references.
630+
631+
The suite has two subgroups, selected via `BENCH_SUBGROUP`:
632+
633+
- **`wide`** — runs against a 1024-col synthetic dataset. This is the
634+
actual workload.
635+
- **`narrow`** — runs the same SQL against an 8-col version of the same
636+
dataset (same row count, file count, per-file row-group shape).
637+
This subgroup exists **only as a baseline for the wide subgroup**
638+
reading its numbers in isolation tells you very little. The
639+
per-query wide-vs-narrow ratio is what isolates the schema-width
640+
cost.
641+
642+
A few queries (Q02, Q08, Q11, Q12) reference suffix-renamed columns
643+
that only exist in the wide dataset, so they hardcode
644+
`subgroup wide` and are skipped on the narrow pass.
645+
646+
The data preparation step (`gen_wide_data`) synthesizes a generic
647+
8-column base schema (`id`, `value`, `count`, `ts`, `category`,
648+
`flag`, `status`, `text`) with deterministic data, then replicates it
649+
128× via suffix renaming (`_2`, `_3`, …) for 1024 columns total —
650+
written across 256 files at 50 k rows per file with one row group per
651+
file and ZSTD(1) compression. Copies 2..128 are zero-filled arrays so
652+
the schema is wide (every column still has its own footer / page
653+
index / column-chunk metadata) but the on-disk size stays around
654+
225 MB. The narrow dataset is written the same way without the suffix
655+
copies. The only variable between wide and narrow is schema width.
656+
657+
```shell
658+
./benchmarks/bench.sh data wide_schema # synthesizes wide (1024 cols × 256 files) + narrow (8 cols × 256 files), ~60 s, ~335 MB
659+
./benchmarks/bench.sh run wide_schema # runs both 'wide' and 'narrow' subgroups; compare the per-query times for the slowdown ratio
660+
```
661+
662+
The queries are deliberately small-projection (touch ≤ 4 columns) so
663+
the wide-schema overhead is the dominant signal. Coverage:
664+
665+
- `Q01` — filter + project + `ORDER BY` + `LIMIT` (TopK shortcut)
666+
- `Q02` — same shape as Q01 but filtering on suffix-renamed columns
667+
far from the start of the schema (wide-only)
668+
- `Q03` — project 1 column with a tight filter and `LIMIT 1`
669+
- `Q08``GROUP BY` a suffix-renamed column (wide-only)
670+
- `Q10` — tight filter + small projection, no sort
671+
- `Q11``ORDER BY` a suffix-renamed column with `LIMIT` (wide-only)
672+
- `Q12` — same as Q03 with an `expect_plan` regression guard for
673+
projection pushdown (wide-only)
674+
- `Q13` — two low-cardinality string filters + a non-stat-prunable
675+
modulo predicate for tight selectivity, project two columns, no
676+
`LIMIT` or `ORDER BY`
677+
678+
For cold-start measurements that include planner setup (the regime
679+
where this overhead is most visible), invoke `datafusion-cli`
680+
directly against `data/wide_schema/{wide,narrow}/`.
681+
623682
## TPCDS
624683

625684
Run the tpcds benchmark.

benchmarks/bench.sh

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ sort_tpch: Benchmark of sorting speed for end-to-end sort queries o
100100
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
101101
topk_tpch: Benchmark of top-k (sorting with limit) queries on TPC-H dataset (SF=1)
102102
external_aggr: External aggregation benchmark on TPC-H dataset (SF=1)
103+
wide_schema: Small-projection queries on a wide synthetic dataset (1024 cols × 256 files) — measures per-file metadata overhead
104+
(runs both 'wide' and 'narrow' subgroups: narrow is an internal baseline; the wide-vs-narrow ratio is the signal)
103105
104106
# ClickBench Benchmarks
105107
clickbench_1: ClickBench queries against a single parquet file
@@ -239,6 +241,9 @@ main() {
239241
tpch_csv10)
240242
data_tpch "10" "csv"
241243
;;
244+
wide_schema)
245+
data_wide_schema
246+
;;
242247
tpcds)
243248
data_tpcds
244249
;;
@@ -444,6 +449,9 @@ main() {
444449
tpch_mem10)
445450
run_tpch_mem "10"
446451
;;
452+
wide_schema)
453+
run_wide_schema
454+
;;
447455
tpcds)
448456
run_tpcds
449457
;;
@@ -698,6 +706,68 @@ run_tpch() {
698706
bash -c "$SQL_CARGO_COMMAND"
699707
}
700708

709+
# Synthesizes two parquet datasets used to measure per-file metadata
710+
# overhead of a wide schema:
711+
#
712+
# - data/wide_schema/wide/ 1024-col events × 256 files (~225 MB)
713+
# - data/wide_schema/narrow/ 8-col events × 256 files (~110 MB)
714+
#
715+
# Both share row count, file count, and per-file row-group shape; only
716+
# schema width differs. No external data source required — gen_wide_data
717+
# synthesizes everything from scratch in ~60 s.
718+
data_wide_schema() {
719+
NUM_FILES=256
720+
ROWS_PER_FILE=50000
721+
WIDTH_FACTOR=128
722+
723+
DST_DIR="${DATA_DIR}/wide_schema"
724+
WIDE_DIR="${DST_DIR}/wide"
725+
NARROW_DIR="${DST_DIR}/narrow"
726+
727+
if [ -d "${WIDE_DIR}" ] && [ "$(ls -A "${WIDE_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
728+
echo " wide parquet exists (${WIDE_DIR})."
729+
else
730+
mkdir -p "${WIDE_DIR}"
731+
echo " synthesizing wide -> ${WIDE_DIR} (factor ${WIDTH_FACTOR}, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
732+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
733+
--dst-dir "${WIDE_DIR}" \
734+
--width-factor ${WIDTH_FACTOR} \
735+
--num-files ${NUM_FILES} \
736+
--rows-per-file ${ROWS_PER_FILE}
737+
fi
738+
739+
if [ -d "${NARROW_DIR}" ] && [ "$(ls -A "${NARROW_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
740+
echo " narrow parquet exists (${NARROW_DIR})."
741+
else
742+
mkdir -p "${NARROW_DIR}"
743+
echo " synthesizing narrow -> ${NARROW_DIR} (8 base cols, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
744+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
745+
--dst-dir "${NARROW_DIR}" \
746+
--width-factor 1 \
747+
--num-files ${NUM_FILES} \
748+
--rows-per-file ${ROWS_PER_FILE}
749+
fi
750+
}
751+
752+
# Runs the wide_schema benchmark. Each query has a `subgroup`
753+
# directive that picks up BENCH_SUBGROUP, so we invoke the framework
754+
# twice — once with subgroup=wide (the actual workload) and once with
755+
# subgroup=narrow (the baseline). The wide-only queries (Q02/Q08/Q11/Q12)
756+
# hardcode `subgroup wide`, so they're skipped on the narrow pass.
757+
run_wide_schema() {
758+
echo "Running wide_schema benchmark (wide subgroup)..."
759+
debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=wide \
760+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
761+
${QUERY:+BENCH_QUERY="${QUERY}"} \
762+
bash -c "$SQL_CARGO_COMMAND"
763+
764+
echo "Running wide_schema benchmark (narrow baseline subgroup)..."
765+
debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=narrow \
766+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
767+
${QUERY:+BENCH_QUERY="${QUERY}"} \
768+
bash -c "$SQL_CARGO_COMMAND"
769+
}
770+
701771
# Runs the tpch in memory (needs tpch parquet data)
702772
run_tpch_mem() {
703773
SCALE_FACTOR=$1

benchmarks/sql_benchmarks/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ in the community:
4141
| `taxi` | NYC taxi dataset benchmark |
4242
| `tpcds` | TPC‑DS queries |
4343
| `tpch` | TPC‑H queries |
44+
| `wide_schema` | Small-projection queries on a wide (1024-col, 256-file) synthetic dataset; runs `wide` + `narrow` subgroups for comparison |
4445

4546
# Running Benchmarks
4647

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
-- Filter on three low-cardinality columns, project four columns,
2+
-- ORDER BY + LIMIT (TopK shortcut). Runs on both wide and narrow
3+
-- datasets via BENCH_SUBGROUP.
4+
5+
name Q01
6+
group wide_schema
7+
subgroup ${BENCH_SUBGROUP:-wide}
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT id, ts, value, text
18+
FROM events
19+
WHERE category = 'c0'
20+
AND flag = 'f0'
21+
AND status = 's0'
22+
ORDER BY ts DESC
23+
LIMIT 100;
24+
25+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
-- Same shape as Q01 but filtering on duplicated columns far from the
2+
-- start of the schema (suffix copies _10). Tests whether planning /
3+
-- pruning cost depends on column position. Wide-only — the narrow
4+
-- dataset doesn't have suffix-renamed copies.
5+
6+
name Q02
7+
group wide_schema
8+
subgroup wide
9+
10+
load sql_benchmarks/wide_schema/init/load.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE category_10 = 'c0'
21+
AND flag_10 = 'f0'
22+
AND status_10 = 's0'
23+
ORDER BY ts DESC
24+
LIMIT 100;
25+
26+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- Project 1 column with a very tight filter. Stresses minimum-
2+
-- projection pushdown over a wide schema. Runs on both wide and
3+
-- narrow datasets via BENCH_SUBGROUP.
4+
5+
name Q03
6+
group wide_schema
7+
subgroup ${BENCH_SUBGROUP:-wide}
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT value
18+
FROM events
19+
WHERE id = 12345
20+
LIMIT 1;
21+
22+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
-- GROUP BY a duplicated column, single aggregate. Checks aggregator
2+
-- behaviour when the grouping column comes from late in a wide
3+
-- schema. Wide-only — the narrow dataset doesn't have suffix copies.
4+
5+
name Q08
6+
group wide_schema
7+
subgroup wide
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT category_5, COUNT(*) AS c
18+
FROM events
19+
GROUP BY category_5
20+
ORDER BY c DESC
21+
LIMIT 10;
22+
23+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
-- Tight filter + small projection without ORDER BY / LIMIT — measures
2+
-- pure filter+project throughput against a wide schema (no TopK
3+
-- shortcut). The filter is tight so the result set stays small.
4+
-- Runs on both wide and narrow datasets via BENCH_SUBGROUP.
5+
6+
name Q10
7+
group wide_schema
8+
subgroup ${BENCH_SUBGROUP:-wide}
9+
10+
load sql_benchmarks/wide_schema/init/load.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE id = 12345
21+
AND category = 'c0';
22+
23+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- TopK on a duplicated sort column with LIMIT — tests sort/TopK
2+
-- physical operator over a wide schema. Wide-only — sorts on a
3+
-- suffix copy that doesn't exist in the narrow dataset.
4+
5+
name Q11
6+
group wide_schema
7+
subgroup wide
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT id, ts
18+
FROM events
19+
ORDER BY ts_8 DESC
20+
LIMIT 100;
21+
22+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
-- Same as Q03 but with `expect_plan` — guards against regressions
2+
-- where projection pushdown stops trimming the parquet read down to
3+
-- a single column. Wide-only since the projection-pushdown signal
4+
-- only matters when there are many columns to prune.
5+
6+
name Q12
7+
group wide_schema
8+
subgroup wide
9+
10+
load sql_benchmarks/wide_schema/init/load.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
expect_plan projection=[value]
18+
19+
run
20+
SELECT value
21+
FROM events
22+
WHERE id = 12345
23+
LIMIT 1;
24+
25+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql

0 commit comments

Comments
 (0)