-
Notifications
You must be signed in to change notification settings - Fork 2k
feat: add sort pushdown benchmark and SLT tests #21213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -106,6 +106,10 @@ clickbench_partitioned: ClickBench queries against partitioned (100 files) parqu | |
| clickbench_pushdown: ClickBench queries against partitioned (100 files) parquet w/ filter_pushdown enabled | ||
| clickbench_extended: ClickBench \"inspired\" queries against a single parquet (DataFusion specific) | ||
|
|
||
| # Sort Pushdown Benchmarks | ||
| sort_pushdown: Sort pushdown baseline (no WITH ORDER) on TPC-H data (SF=1) | ||
| sort_pushdown_sorted: Sort pushdown with WITH ORDER — tests sort elimination on non-overlapping files | ||
|
|
||
| # Sorted Data Benchmarks (ORDER BY Optimization) | ||
| clickbench_sorted: ClickBench queries on pre-sorted data using prefer_existing_sort (tests sort elimination optimization) | ||
|
|
||
|
|
@@ -309,6 +313,10 @@ main() { | |
| # same data as for tpch | ||
| data_tpch "1" "parquet" | ||
| ;; | ||
| sort_pushdown|sort_pushdown_sorted) | ||
| # same data as for tpch | ||
| data_tpch "1" "parquet" | ||
| ;; | ||
| sort_tpch) | ||
| # same data as for tpch | ||
| data_tpch "1" "parquet" | ||
|
|
@@ -509,6 +517,12 @@ main() { | |
| external_aggr) | ||
| run_external_aggr | ||
| ;; | ||
| sort_pushdown) | ||
| run_sort_pushdown | ||
| ;; | ||
| sort_pushdown_sorted) | ||
| run_sort_pushdown_sorted | ||
| ;; | ||
| sort_tpch) | ||
| run_sort_tpch "1" | ||
| ;; | ||
|
|
@@ -1070,6 +1084,22 @@ run_external_aggr() { | |
| debug_run $CARGO_COMMAND --bin external_aggr -- benchmark --partitions 4 --iterations 5 --path "${TPCH_DIR}" -o "${RESULTS_FILE}" ${QUERY_ARG} | ||
| } | ||
|
|
||
| # Runs the sort pushdown benchmark (without WITH ORDER) | ||
| run_sort_pushdown() { | ||
| TPCH_DIR="${DATA_DIR}/tpch_sf1" | ||
| RESULTS_FILE="${RESULTS_DIR}/sort_pushdown.json" | ||
| echo "Running sort pushdown benchmark (no WITH ORDER)..." | ||
| debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --iterations 5 --path "${TPCH_DIR}" -o "${RESULTS_FILE}" ${QUERY_ARG} ${LATENCY_ARG} | ||
| } | ||
|
Comment on lines
+1090
to
+1093
|
||
|
|
||
| # Runs the sort pushdown benchmark with WITH ORDER (enables sort elimination) | ||
| run_sort_pushdown_sorted() { | ||
| TPCH_DIR="${DATA_DIR}/tpch_sf1" | ||
| RESULTS_FILE="${RESULTS_DIR}/sort_pushdown_sorted.json" | ||
| echo "Running sort pushdown benchmark (with WITH ORDER)..." | ||
| debug_run $CARGO_COMMAND --bin dfbench -- sort-pushdown --sorted --iterations 5 --path "${TPCH_DIR}" -o "${RESULTS_FILE}" ${QUERY_ARG} ${LATENCY_ARG} | ||
| } | ||
|
|
||
| # Runs the sort integration benchmark | ||
| run_sort_tpch() { | ||
| SCALE_FACTOR=$1 | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| -- Sort elimination: ORDER BY sort key ASC (full scan) | ||
| -- With --sorted: SortExec removed, sequential scan in file order | ||
| -- Without --sorted: full SortExec required | ||
| SELECT l_orderkey, l_partkey, l_suppkey | ||
| FROM lineitem | ||
| ORDER BY l_orderkey |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| -- Sort elimination + limit pushdown | ||
| -- With --sorted: SortExec removed + limit pushed to DataSourceExec | ||
| -- Without --sorted: TopK sort over all data | ||
| SELECT l_orderkey, l_partkey, l_suppkey | ||
| FROM lineitem | ||
| ORDER BY l_orderkey | ||
| LIMIT 100 |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| -- Sort elimination: wide projection (all columns) | ||
| -- Tests sort elimination benefit with larger row payload | ||
| SELECT * | ||
| FROM lineitem | ||
| ORDER BY l_orderkey |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| -- Sort elimination + limit: wide projection | ||
| SELECT * | ||
| FROM lineitem | ||
| ORDER BY l_orderkey | ||
| LIMIT 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sort_pushdown|sort_pushdown_sortedreusesdata_tpch "1" "parquet", which currently generates parquet with--parts=1. That means the benchmark will typically run against a single lineitem parquet file, not multiple non-overlapping files as described. If the intent is to benchmark cross-file sort elimination, consider generating TPCH parquet with multiple parts for this benchmark (and documenting that the files are expected to be sorted byl_orderkey).