Skip to content

Commit 63eb746

Browse files
adriangbclaude
andcommitted
Add wide-schema benchmark suite for measuring per-file metadata overhead
Adds a new sql_benchmarks suite that isolates the wide-schema scan overhead in selective parquet queries: the regime where most of the work is loading footers / column-chunk metadata rather than reading row data, and that cost scales with the number of column chunks in the dataset rather than with the number of columns the query touches. The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP): - wide: 1024 cols x 256 files x 50k rows (~225 MB) — the workload - narrow: 8 cols x 256 files x 50k rows (~110 MB) — internal baseline, only meaningful as a comparison point Both share row count, file count, and per-file row-group structure; only schema width differs. All 4 queries run on both subgroups so every wide number has a directly comparable narrow baseline. A new gen_wide_data binary synthesizes both datasets in ~60 s with no external data source. The 8-column base schema (id, value, count, ts, category, flag, status, text) carries deterministic data; copies 2..N from the suffix-renamed widening are zero-filled (zero rather than null since reader-side null-array shortcuts mute the slowdown by ~35 %). Query coverage: - Q01: filter + project + ORDER BY + LIMIT (TopK shortcut) - Q02: project 1 column with tight filter + LIMIT 1 - Q03: tight filter + small projection, no sort - Q04: two low-cardinality string filters + a non-stat-prunable modulo predicate for tight selectivity (~0.005 % match rate), project two columns, no LIMIT or ORDER BY For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x while bloom_filter_eval_time and statistics_eval_time stay flat. bench.sh adds: - data wide_schema: synthesizes both wide and narrow datasets - run wide_schema: runs the wide subgroup, then the narrow baseline subgroup, for query-by-query comparison Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent bb86364 commit 63eb746

10 files changed

Lines changed: 572 additions & 0 deletions

File tree

benchmarks/README.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -620,6 +620,59 @@ This benchmarks is derived from the [TPC-H][1] version
620620
[2]: https://github.com/databricks/tpch-dbgen.git,
621621
[2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
622622

623+
## Wide-schema benchmark
624+
625+
`wide_schema` measures the per-file metadata overhead of a wide schema
626+
in selective parquet scans — the regime where most of the work is
627+
loading footers / column-chunk metadata rather than reading row data,
628+
and that cost scales linearly with the number of column chunks in the
629+
dataset rather than with the number of columns the query references.
630+
631+
The suite has two subgroups, selected via `BENCH_SUBGROUP`:
632+
633+
- **`wide`** — runs against a 1024-col synthetic dataset. This is the
634+
actual workload.
635+
- **`narrow`** — runs the same SQL against an 8-col version of the same
636+
dataset (same row count, file count, per-file row-group shape).
637+
This subgroup exists **only as a baseline for the wide subgroup**
638+
reading its numbers in isolation tells you very little. The
639+
per-query wide-vs-narrow ratio is what isolates the schema-width
640+
cost.
641+
642+
All queries reference only base columns (no suffix-renamed copies),
643+
so each one runs on both subgroups and produces a directly comparable
644+
wide-vs-narrow pair.
645+
646+
The data preparation step (`gen_wide_data`) synthesizes a generic
647+
8-column base schema (`id`, `value`, `count`, `ts`, `category`,
648+
`flag`, `status`, `text`) with deterministic data, then replicates it
649+
128× via suffix renaming (`_2`, `_3`, …) for 1024 columns total —
650+
written across 256 files at 50 k rows per file with one row group per
651+
file and ZSTD(1) compression. Copies 2..128 are zero-filled arrays so
652+
the schema is wide (every column still has its own footer / page
653+
index / column-chunk metadata) but the on-disk size stays around
654+
225 MB. The narrow dataset is written the same way without the suffix
655+
copies. The only variable between wide and narrow is schema width.
656+
657+
```shell
658+
./benchmarks/bench.sh data wide_schema # synthesizes wide (1024 cols × 256 files) + narrow (8 cols × 256 files), ~60 s, ~335 MB
659+
./benchmarks/bench.sh run wide_schema # runs both 'wide' and 'narrow' subgroups; compare the per-query times for the slowdown ratio
660+
```
661+
662+
The queries are deliberately small-projection (touch ≤ 4 columns) so
663+
the wide-schema overhead is the dominant signal. Coverage:
664+
665+
- `Q01` — filter + project + `ORDER BY` + `LIMIT` (TopK shortcut)
666+
- `Q02` — project 1 column with a tight filter and `LIMIT 1`
667+
- `Q03` — tight filter + small projection, no sort
668+
- `Q04` — two low-cardinality string filters + a non-stat-prunable
669+
modulo predicate for tight selectivity, project two columns, no
670+
`LIMIT` or `ORDER BY`
671+
672+
For cold-start measurements that include planner setup (the regime
673+
where this overhead is most visible), invoke `datafusion-cli`
674+
directly against `data/wide_schema/{wide,narrow}/`.
675+
623676
## TPCDS
624677

625678
Run the tpcds benchmark.

benchmarks/bench.sh

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,8 @@ sort_tpch: Benchmark of sorting speed for end-to-end sort queries o
100100
sort_tpch10: Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
101101
topk_tpch: Benchmark of top-k (sorting with limit) queries on TPC-H dataset (SF=1)
102102
external_aggr: External aggregation benchmark on TPC-H dataset (SF=1)
103+
wide_schema: Small-projection queries on a wide synthetic dataset (1024 cols × 256 files) — measures per-file metadata overhead
104+
(runs both 'wide' and 'narrow' subgroups: narrow is an internal baseline; the wide-vs-narrow ratio is the signal)
103105
104106
# ClickBench Benchmarks
105107
clickbench_1: ClickBench queries against a single parquet file
@@ -239,6 +241,9 @@ main() {
239241
tpch_csv10)
240242
data_tpch "10" "csv"
241243
;;
244+
wide_schema)
245+
data_wide_schema
246+
;;
242247
tpcds)
243248
data_tpcds
244249
;;
@@ -444,6 +449,9 @@ main() {
444449
tpch_mem10)
445450
run_tpch_mem "10"
446451
;;
452+
wide_schema)
453+
run_wide_schema
454+
;;
447455
tpcds)
448456
run_tpcds
449457
;;
@@ -698,6 +706,68 @@ run_tpch() {
698706
bash -c "$SQL_CARGO_COMMAND"
699707
}
700708

709+
# Synthesizes two parquet datasets used to measure per-file metadata
710+
# overhead of a wide schema:
711+
#
712+
# - data/wide_schema/wide/ 1024-col events × 256 files (~225 MB)
713+
# - data/wide_schema/narrow/ 8-col events × 256 files (~110 MB)
714+
#
715+
# Both share row count, file count, and per-file row-group shape; only
716+
# schema width differs. No external data source required — gen_wide_data
717+
# synthesizes everything from scratch in ~60 s.
718+
data_wide_schema() {
719+
NUM_FILES=256
720+
ROWS_PER_FILE=50000
721+
WIDTH_FACTOR=128
722+
723+
DST_DIR="${DATA_DIR}/wide_schema"
724+
WIDE_DIR="${DST_DIR}/wide"
725+
NARROW_DIR="${DST_DIR}/narrow"
726+
727+
if [ -d "${WIDE_DIR}" ] && [ "$(ls -A "${WIDE_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
728+
echo " wide parquet exists (${WIDE_DIR})."
729+
else
730+
mkdir -p "${WIDE_DIR}"
731+
echo " synthesizing wide -> ${WIDE_DIR} (factor ${WIDTH_FACTOR}, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
732+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
733+
--dst-dir "${WIDE_DIR}" \
734+
--width-factor ${WIDTH_FACTOR} \
735+
--num-files ${NUM_FILES} \
736+
--rows-per-file ${ROWS_PER_FILE}
737+
fi
738+
739+
if [ -d "${NARROW_DIR}" ] && [ "$(ls -A "${NARROW_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
740+
echo " narrow parquet exists (${NARROW_DIR})."
741+
else
742+
mkdir -p "${NARROW_DIR}"
743+
echo " synthesizing narrow -> ${NARROW_DIR} (8 base cols, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
744+
debug_run $CARGO_COMMAND --bin gen_wide_data -- \
745+
--dst-dir "${NARROW_DIR}" \
746+
--width-factor 1 \
747+
--num-files ${NUM_FILES} \
748+
--rows-per-file ${ROWS_PER_FILE}
749+
fi
750+
}
751+
752+
# Runs the wide_schema benchmark. Each query has a `subgroup`
753+
# directive that picks up BENCH_SUBGROUP, so we invoke the framework
754+
# twice — once with subgroup=wide (the actual workload) and once with
755+
# subgroup=narrow (the baseline). The wide-only queries (Q02/Q08/Q11/Q12)
756+
# hardcode `subgroup wide`, so they're skipped on the narrow pass.
757+
run_wide_schema() {
758+
echo "Running wide_schema benchmark (wide subgroup)..."
759+
debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=wide \
760+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
761+
${QUERY:+BENCH_QUERY="${QUERY}"} \
762+
bash -c "$SQL_CARGO_COMMAND"
763+
764+
echo "Running wide_schema benchmark (narrow baseline subgroup)..."
765+
debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=narrow \
766+
SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
767+
${QUERY:+BENCH_QUERY="${QUERY}"} \
768+
bash -c "$SQL_CARGO_COMMAND"
769+
}
770+
701771
# Runs the tpch in memory (needs tpch parquet data)
702772
run_tpch_mem() {
703773
SCALE_FACTOR=$1

benchmarks/sql_benchmarks/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@ in the community:
4141
| `taxi` | NYC taxi dataset benchmark |
4242
| `tpcds` | TPC‑DS queries |
4343
| `tpch` | TPC‑H queries |
44+
| `wide_schema` | Small-projection queries on a wide (1024-col, 256-file) synthetic dataset; runs `wide` + `narrow` subgroups for comparison |
4445

4546
# Running Benchmarks
4647

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
-- Filter on three low-cardinality columns, project four columns,
2+
-- ORDER BY + LIMIT (TopK shortcut). Runs on both wide and narrow
3+
-- datasets via BENCH_SUBGROUP.
4+
5+
name Q01
6+
group wide_schema
7+
subgroup ${BENCH_SUBGROUP:-wide}
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT id, ts, value, text
18+
FROM events
19+
WHERE category = 'c0'
20+
AND flag = 'f0'
21+
AND status = 's0'
22+
ORDER BY ts DESC
23+
LIMIT 100;
24+
25+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
-- Project 1 column with a very tight filter. Stresses minimum-
2+
-- projection pushdown over a wide schema. Runs on both wide and
3+
-- narrow datasets via BENCH_SUBGROUP.
4+
5+
name Q02
6+
group wide_schema
7+
subgroup ${BENCH_SUBGROUP:-wide}
8+
9+
load sql_benchmarks/wide_schema/init/load.sql
10+
11+
assert I
12+
SELECT COUNT(*) > 0 from events;
13+
----
14+
true
15+
16+
run
17+
SELECT value
18+
FROM events
19+
WHERE id = 12345
20+
LIMIT 1;
21+
22+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
-- Tight filter + small projection without ORDER BY / LIMIT — measures
2+
-- pure filter+project throughput against a wide schema (no TopK
3+
-- shortcut). The filter is tight so the result set stays small.
4+
-- Runs on both wide and narrow datasets via BENCH_SUBGROUP.
5+
6+
name Q03
7+
group wide_schema
8+
subgroup ${BENCH_SUBGROUP:-wide}
9+
10+
load sql_benchmarks/wide_schema/init/load.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts, value, text
19+
FROM events
20+
WHERE id = 12345
21+
AND category = 'c0';
22+
23+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
-- Two low-cardinality string filters + a non-stat-prunable modulo
2+
-- predicate for tight selectivity (~0.005 % match rate), project two
3+
-- columns, no LIMIT, no ORDER BY. Runs on both wide and narrow
4+
-- datasets via BENCH_SUBGROUP.
5+
6+
name Q04
7+
group wide_schema
8+
subgroup ${BENCH_SUBGROUP:-wide}
9+
10+
load sql_benchmarks/wide_schema/init/load.sql
11+
12+
assert I
13+
SELECT COUNT(*) > 0 from events;
14+
----
15+
true
16+
17+
run
18+
SELECT id, ts
19+
FROM events
20+
WHERE category = 'c0'
21+
AND flag = 'f0'
22+
AND id % 1000 = 0;
23+
24+
cleanup sql_benchmarks/wide_schema/init/cleanup.sql
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
DROP TABLE IF EXISTS events;
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
-- Registers the events table, picking the dataset based on BENCH_SUBGROUP:
2+
--
3+
-- BENCH_SUBGROUP=wide → 1024-col synthetic dataset (the actual benchmark)
4+
-- BENCH_SUBGROUP=narrow → 8-col baseline (companion only — meaningful
5+
-- only when compared to the wide numbers)
6+
CREATE EXTERNAL TABLE events STORED AS PARQUET LOCATION 'data/wide_schema/${BENCH_SUBGROUP:-wide}/';

0 commit comments

Comments
 (0)