Add wide-schema benchmark suite for measuring per-file metadata overhead

adriangb · claude · adriangb · commit 63eb7463046b · 2026-05-01T10:53:45.000-05:00
Adds a new sql_benchmarks suite that isolates the wide-schema scan
overhead in selective parquet queries: the regime where most of the
work is loading footers / column-chunk metadata rather than reading
row data, and that cost scales with the number of column chunks in
the dataset rather than with the number of columns the query touches.

The wide_schema suite has two subgroups (selected via BENCH_SUBGROUP):

  - wide:   1024 cols x 256 files x 50k rows (~225 MB) — the workload
  - narrow:    8 cols x 256 files x 50k rows (~110 MB) — internal
              baseline, only meaningful as a comparison point

Both share row count, file count, and per-file row-group structure;
only schema width differs. All 4 queries run on both subgroups so
every wide number has a directly comparable narrow baseline.

A new gen_wide_data binary synthesizes both datasets in ~60 s with no
external data source. The 8-column base schema (id, value, count, ts,
category, flag, status, text) carries deterministic data; copies 2..N
from the suffix-renamed widening are zero-filled (zero rather than
null since reader-side null-array shortcuts mute the slowdown by
~35 %).

Query coverage:

  - Q01: filter + project + ORDER BY + LIMIT (TopK shortcut)
  - Q02: project 1 column with tight filter + LIMIT 1
  - Q03: tight filter + small projection, no sort
  - Q04: two low-cardinality string filters + a non-stat-prunable
         modulo predicate for tight selectivity (~0.005 % match rate),
         project two columns, no LIMIT or ORDER BY

For Q04 specifically, cold-start datafusion-cli shows ~15x slowdown
wide vs narrow; EXPLAIN ANALYZE shows metadata_load_time scaling 141x
while bloom_filter_eval_time and statistics_eval_time stay flat.

bench.sh adds:
  - data wide_schema:  synthesizes both wide and narrow datasets
  - run wide_schema:   runs the wide subgroup, then the narrow
                       baseline subgroup, for query-by-query comparison

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/benchmarks/README.md b/benchmarks/README.md
@@ -620,6 +620,59 @@ This benchmarks is derived from the [TPC-H][1] version
 [2]: https://github.com/databricks/tpch-dbgen.git,
 [2.17.1]: https://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf
 
+## Wide-schema benchmark
+
+`wide_schema` measures the per-file metadata overhead of a wide schema
+in selective parquet scans — the regime where most of the work is
+loading footers / column-chunk metadata rather than reading row data,
+and that cost scales linearly with the number of column chunks in the
+dataset rather than with the number of columns the query references.
+
+The suite has two subgroups, selected via `BENCH_SUBGROUP`:
+
+- **`wide`** — runs against a 1024-col synthetic dataset. This is the
+  actual workload.
+- **`narrow`** — runs the same SQL against an 8-col version of the same
+  dataset (same row count, file count, per-file row-group shape).
+  This subgroup exists **only as a baseline for the wide subgroup** —
+  reading its numbers in isolation tells you very little. The
+  per-query wide-vs-narrow ratio is what isolates the schema-width
+  cost.
+
+All queries reference only base columns (no suffix-renamed copies),
+so each one runs on both subgroups and produces a directly comparable
+wide-vs-narrow pair.
+
+The data preparation step (`gen_wide_data`) synthesizes a generic
+8-column base schema (`id`, `value`, `count`, `ts`, `category`,
+`flag`, `status`, `text`) with deterministic data, then replicates it
+128× via suffix renaming (`_2`, `_3`, …) for 1024 columns total —
+written across 256 files at 50 k rows per file with one row group per
+file and ZSTD(1) compression. Copies 2..128 are zero-filled arrays so
+the schema is wide (every column still has its own footer / page
+index / column-chunk metadata) but the on-disk size stays around
+225 MB. The narrow dataset is written the same way without the suffix
+copies. The only variable between wide and narrow is schema width.
+
+```shell
+./benchmarks/bench.sh data wide_schema    # synthesizes wide (1024 cols × 256 files) + narrow (8 cols × 256 files), ~60 s, ~335 MB
+./benchmarks/bench.sh run  wide_schema    # runs both 'wide' and 'narrow' subgroups; compare the per-query times for the slowdown ratio
+```
+
+The queries are deliberately small-projection (touch ≤ 4 columns) so
+the wide-schema overhead is the dominant signal. Coverage:
+
+- `Q01` — filter + project + `ORDER BY` + `LIMIT` (TopK shortcut)
+- `Q02` — project 1 column with a tight filter and `LIMIT 1`
+- `Q03` — tight filter + small projection, no sort
+- `Q04` — two low-cardinality string filters + a non-stat-prunable
+  modulo predicate for tight selectivity, project two columns, no
+  `LIMIT` or `ORDER BY`
+
+For cold-start measurements that include planner setup (the regime
+where this overhead is most visible), invoke `datafusion-cli`
+directly against `data/wide_schema/{wide,narrow}/`.
+
 ## TPCDS
 
 Run the tpcds benchmark.
diff --git a/benchmarks/bench.sh b/benchmarks/bench.sh
@@ -100,6 +100,8 @@ sort_tpch:              Benchmark of sorting speed for end-to-end sort queries o
 sort_tpch10:            Benchmark of sorting speed for end-to-end sort queries on TPC-H dataset (SF=10)
 topk_tpch:              Benchmark of top-k (sorting with limit) queries on TPC-H dataset (SF=1)
 external_aggr:          External aggregation benchmark on TPC-H dataset (SF=1)
+wide_schema:            Small-projection queries on a wide synthetic dataset (1024 cols × 256 files) — measures per-file metadata overhead
+                          (runs both 'wide' and 'narrow' subgroups: narrow is an internal baseline; the wide-vs-narrow ratio is the signal)
 
 # ClickBench Benchmarks
 clickbench_1:           ClickBench queries against a single parquet file
@@ -239,6 +241,9 @@ main() {
                 tpch_csv10)
                     data_tpch "10" "csv"
                     ;;
+                wide_schema)
+                    data_wide_schema
+                    ;;
                 tpcds)
                     data_tpcds
                     ;;
@@ -444,6 +449,9 @@ main() {
                 tpch_mem10)
                     run_tpch_mem "10"
                     ;;
+                wide_schema)
+                    run_wide_schema
+                    ;;
                 tpcds)
                     run_tpcds
                     ;;
@@ -698,6 +706,68 @@ run_tpch() {
       bash -c "$SQL_CARGO_COMMAND"
 }
 
+# Synthesizes two parquet datasets used to measure per-file metadata
+# overhead of a wide schema:
+#
+#   - data/wide_schema/wide/    1024-col events × 256 files (~225 MB)
+#   - data/wide_schema/narrow/    8-col events × 256 files (~110 MB)
+#
+# Both share row count, file count, and per-file row-group shape; only
+# schema width differs. No external data source required — gen_wide_data
+# synthesizes everything from scratch in ~60 s.
+data_wide_schema() {
+    NUM_FILES=256
+    ROWS_PER_FILE=50000
+    WIDTH_FACTOR=128
+
+    DST_DIR="${DATA_DIR}/wide_schema"
+    WIDE_DIR="${DST_DIR}/wide"
+    NARROW_DIR="${DST_DIR}/narrow"
+
+    if [ -d "${WIDE_DIR}" ] && [ "$(ls -A "${WIDE_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
+        echo " wide parquet exists (${WIDE_DIR})."
+    else
+        mkdir -p "${WIDE_DIR}"
+        echo " synthesizing wide -> ${WIDE_DIR} (factor ${WIDTH_FACTOR}, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
+        debug_run $CARGO_COMMAND --bin gen_wide_data -- \
+            --dst-dir "${WIDE_DIR}" \
+            --width-factor ${WIDTH_FACTOR} \
+            --num-files ${NUM_FILES} \
+            --rows-per-file ${ROWS_PER_FILE}
+    fi
+
+    if [ -d "${NARROW_DIR}" ] && [ "$(ls -A "${NARROW_DIR}" 2>/dev/null | wc -l)" -ge ${NUM_FILES} ]; then
+        echo " narrow parquet exists (${NARROW_DIR})."
+    else
+        mkdir -p "${NARROW_DIR}"
+        echo " synthesizing narrow -> ${NARROW_DIR} (8 base cols, ${NUM_FILES} files × ${ROWS_PER_FILE} rows) ..."
+        debug_run $CARGO_COMMAND --bin gen_wide_data -- \
+            --dst-dir "${NARROW_DIR}" \
+            --width-factor 1 \
+            --num-files ${NUM_FILES} \
+            --rows-per-file ${ROWS_PER_FILE}
+    fi
+}
+
+# Runs the wide_schema benchmark. Each query has a `subgroup`
+# directive that picks up BENCH_SUBGROUP, so we invoke the framework
+# twice — once with subgroup=wide (the actual workload) and once with
+# subgroup=narrow (the baseline). The wide-only queries (Q02/Q08/Q11/Q12)
+# hardcode `subgroup wide`, so they're skipped on the narrow pass.
+run_wide_schema() {
+    echo "Running wide_schema benchmark (wide subgroup)..."
+    debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=wide \
+      SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
+      ${QUERY:+BENCH_QUERY="${QUERY}"}  \
+      bash -c "$SQL_CARGO_COMMAND"
+
+    echo "Running wide_schema benchmark (narrow baseline subgroup)..."
+    debug_run env BENCH_NAME=wide_schema BENCH_SUBGROUP=narrow \
+      SIMULATE_LATENCY="${SIMULATE_LATENCY}" \
+      ${QUERY:+BENCH_QUERY="${QUERY}"}  \
+      bash -c "$SQL_CARGO_COMMAND"
+}
+
 # Runs the tpch in memory (needs tpch parquet data)
 run_tpch_mem() {
     SCALE_FACTOR=$1
diff --git a/benchmarks/sql_benchmarks/README.md b/benchmarks/sql_benchmarks/README.md
@@ -41,6 +41,7 @@ in the community:
 | `taxi`                | NYC taxi dataset benchmark                                         |
 | `tpcds`               | TPC‑DS queries                                                     |
 | `tpch`                | TPC‑H queries                                                      |
+| `wide_schema`         | Small-projection queries on a wide (1024-col, 256-file) synthetic dataset; runs `wide` + `narrow` subgroups for comparison |
 
 # Running Benchmarks
 
diff --git a/benchmarks/sql_benchmarks/wide_schema/benchmarks/q01.benchmark b/benchmarks/sql_benchmarks/wide_schema/benchmarks/q01.benchmark
@@ -0,0 +1,25 @@
+-- Filter on three low-cardinality columns, project four columns,
+-- ORDER BY + LIMIT (TopK shortcut). Runs on both wide and narrow
+-- datasets via BENCH_SUBGROUP.
+
+name Q01
+group wide_schema
+subgroup ${BENCH_SUBGROUP:-wide}
+
+load sql_benchmarks/wide_schema/init/load.sql
+
+assert I
+SELECT COUNT(*) > 0 from events;
+----
+true
+
+run
+SELECT id, ts, value, text
+FROM events
+WHERE category = 'c0'
+  AND flag     = 'f0'
+  AND status   = 's0'
+ORDER BY ts DESC
+LIMIT 100;
+
+cleanup sql_benchmarks/wide_schema/init/cleanup.sql
diff --git a/benchmarks/sql_benchmarks/wide_schema/benchmarks/q02.benchmark b/benchmarks/sql_benchmarks/wide_schema/benchmarks/q02.benchmark
@@ -0,0 +1,22 @@
+-- Project 1 column with a very tight filter. Stresses minimum-
+-- projection pushdown over a wide schema. Runs on both wide and
+-- narrow datasets via BENCH_SUBGROUP.
+
+name Q02
+group wide_schema
+subgroup ${BENCH_SUBGROUP:-wide}
+
+load sql_benchmarks/wide_schema/init/load.sql
+
+assert I
+SELECT COUNT(*) > 0 from events;
+----
+true
+
+run
+SELECT value
+FROM events
+WHERE id = 12345
+LIMIT 1;
+
+cleanup sql_benchmarks/wide_schema/init/cleanup.sql
diff --git a/benchmarks/sql_benchmarks/wide_schema/benchmarks/q03.benchmark b/benchmarks/sql_benchmarks/wide_schema/benchmarks/q03.benchmark
@@ -0,0 +1,23 @@
+-- Tight filter + small projection without ORDER BY / LIMIT — measures
+-- pure filter+project throughput against a wide schema (no TopK
+-- shortcut). The filter is tight so the result set stays small.
+-- Runs on both wide and narrow datasets via BENCH_SUBGROUP.
+
+name Q03
+group wide_schema
+subgroup ${BENCH_SUBGROUP:-wide}
+
+load sql_benchmarks/wide_schema/init/load.sql
+
+assert I
+SELECT COUNT(*) > 0 from events;
+----
+true
+
+run
+SELECT id, ts, value, text
+FROM events
+WHERE id = 12345
+  AND category = 'c0';
+
+cleanup sql_benchmarks/wide_schema/init/cleanup.sql
diff --git a/benchmarks/sql_benchmarks/wide_schema/benchmarks/q04.benchmark b/benchmarks/sql_benchmarks/wide_schema/benchmarks/q04.benchmark
@@ -0,0 +1,24 @@
+-- Two low-cardinality string filters + a non-stat-prunable modulo
+-- predicate for tight selectivity (~0.005 % match rate), project two
+-- columns, no LIMIT, no ORDER BY. Runs on both wide and narrow
+-- datasets via BENCH_SUBGROUP.
+
+name Q04
+group wide_schema
+subgroup ${BENCH_SUBGROUP:-wide}
+
+load sql_benchmarks/wide_schema/init/load.sql
+
+assert I
+SELECT COUNT(*) > 0 from events;
+----
+true
+
+run
+SELECT id, ts
+FROM events
+WHERE category = 'c0'
+  AND flag     = 'f0'
+  AND id % 1000 = 0;
+
+cleanup sql_benchmarks/wide_schema/init/cleanup.sql
diff --git a/benchmarks/sql_benchmarks/wide_schema/init/cleanup.sql b/benchmarks/sql_benchmarks/wide_schema/init/cleanup.sql
@@ -0,0 +1 @@
+DROP TABLE IF EXISTS events;
diff --git a/benchmarks/sql_benchmarks/wide_schema/init/load.sql b/benchmarks/sql_benchmarks/wide_schema/init/load.sql
@@ -0,0 +1,6 @@
+-- Registers the events table, picking the dataset based on BENCH_SUBGROUP:
+--
+--   BENCH_SUBGROUP=wide   → 1024-col synthetic dataset (the actual benchmark)
+--   BENCH_SUBGROUP=narrow → 8-col baseline (companion only — meaningful
+--                           only when compared to the wide numbers)
+CREATE EXTERNAL TABLE events STORED AS PARQUET LOCATION 'data/wide_schema/${BENCH_SUBGROUP:-wide}/';
diff --git a/benchmarks/src/bin/gen_wide_data.rs b/benchmarks/src/bin/gen_wide_data.rs