Perf: Window topn optimisation by SubhamSinghal · Pull Request #21479 · apache/datafusion

SubhamSinghal · 2026-04-08T17:50:08Z

Which issue does this PR close?

Related to Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899.

Rationale for this change

Queries like SELECT *, ROW_NUMBER() OVER (PARTITION BY pk ORDER BY val) as rn FROM t WHERE rn <= K are extremely common in analytics ("top N per group"). The current plan sorts the entire dataset O(N log N), computes ROW_NUMBER for all rows, then filters. With 10M rows, 1K partitions, and K=3, we sort all 10M rows but only keep 3K.

This PR introduces a PartitionedTopKExec operator that replaces the SortExec, maintaining a per-partition TopK heap (reusing DataFusion's existing TopK implementation). Cost drops to O(N log K) time and O(K × P × row_size) memory.

What changes are included in this PR?

New physical operator: PartitionedTopKExec (physical-plan/src/sorts/partitioned_topk.rs)

Reads unsorted input, groups rows by partition key using RowConverter, feeds sub-batches to a per-partition TopK heap
Emits only the top-K rows per partition in sorted (partition_keys, order_keys) order
Reuses the existing TopK implementation for heap management, sort key comparison, eviction, and batch compaction

New optimizer rule: WindowTopN (physical-optimizer/src/window_topn.rs)

Detects the pattern:

FilterExec(rn <= K)
  [optional ProjectionExec]
    BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
      SortExec(partition_keys, order_keys)

And replaces it with:

[optional ProjectionExec]
  BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
    PartitionedTopKExec(fetch=K)

Both FilterExec and SortExec are removed.

Supported predicates: rn <= K, rn < K, K >= rn, K > rn.

The rule only fires for ROW_NUMBER with a PARTITION BY clause. Global top-K (no PARTITION BY) is already handled by
SortExec with fetch.

Config flag: datafusion.optimizer.enable_window_topn (default: true)

Benchmark results (H2O groupby Q8, 10M rows, top-2 per partition):

cargo run --release --example h2o_window_topn_bench

Scenario	Enabled (ms)	Disabled (ms)	Speedup
100 partitions (100K rows/part)	43	174	4.0x
1K partitions (10K rows/part)	71	146	2.1x
10K partitions (1K rows/part)	619	128	0.2x (regression)
100K partitions (100 rows/part)	4368	135	0.03x (regression)

The 100K-partition regression is expected: per-partition TopK overhead (RowConverter, MemoryReservation per instance)
dominates when partitions are very numerous with few rows each. For the common case (moderate partition cardinality), the
optimization provides 2-3x speedup.

Are these changes tested?

Yes:

7 unit tests (core/tests/physical_optimizer/window_topn.rs): basic ROW_NUMBER, rn < K, flipped predicates, non-window column filter, config disabled, no partition by, projection between filter and window
5 SLT tests (sqllogictest/test_files/window_topn.slt): correctness verification, EXPLAIN plan validation, rn < K, no-partition-by case, config disabled fallback

Are there any user-facing changes?

No breaking API changes. The optimization is enabled by default and transparent to users. It can be disabled via:

SET datafusion.optimizer.enable_window_topn = false;

2010YOUY01

Thank you — this PR looks really nice.

I took a quick look and left a few suggestions. I’ll review the optimizer rewrite and execution side more carefully later.

2010YOUY01 · 2026-04-09T04:16:48Z

datafusion/common/src/config.rs

+        /// Filter(rn<=K) → Window(ROW_NUMBER) → Sort patterns with a
+        /// PartitionedTopKExec that maintains per-partition heaps, avoiding
+        /// a full sort of the input.
+        pub enable_window_topn: bool, default = true


I suggest to default it to false, for large partition counts, the regression seems significant.
As a follow-up, we could detect the input cardinality and automatically choose the right plan.

2010YOUY01 · 2026-04-09T04:23:25Z

datafusion/core/examples/h2o_window_topn_bench.rs

+// specific language governing permissions and limitations
+// under the License.
+
+// Standalone H2O groupby Q8 benchmark: PartitionedTopKExec enabled vs disabled


We could keep this benchmark in this PR, but it would be great to clean it up later.
To make benchmark maintenance easier, we could directly add queries representing this workload to h2o window benchmark, so that similar benchmarks won't get scattered to multiple places.

datafusion/benchmarks/bench.sh

Line 123 in e1ad871

h2o_small_window: Extended h2oai benchmark with small dataset (1e7 rows) for window, default file format is csv

Though the issue is now the h2o benchmark counts the dataset loading time, so we can't isolate the target executor's processing time, so we could add an option to eliminate the data loading time later 🤔

2010YOUY01 · 2026-04-09T04:31:12Z

datafusion/physical-optimizer/src/window_topn.rs

+/// - `K >= rn` (flipped) → fetch = K
+/// - `K > rn` (flipped) → fetch = K - 1
+///
+/// # When the Rule Does NOT Fire


It would be great to describe when this rule does apply, rather than focusing on when it does not. This optimization should only trigger for a fairly small set of cases.

2010YOUY01 · 2026-04-09T04:34:08Z

datafusion/physical-optimizer/src/window_topn.rs

+        // Step 1: Match FilterExec at the top
+        let filter = plan.downcast_ref::<FilterExec>()?;
+
+        // Don't handle filters with projections


I'm curious why skipping this

The filter's column indices would point to the projected schema, not the window exec's output schema, so our index-based matching for the ROW_NUMBER column would be wrong without resolving the projection mapping. Skipping this case for simplicity right now.

2010YOUY01 · 2026-04-09T04:36:18Z

datafusion/physical-plan/src/sorts/partitioned_topk.rs

+            }
+            DisplayFormatType::TreeRender => {
+                writeln!(f, "fetch={}", self.fetch)?;
+                writeln!(f, "{}", self.expr)


Tree format should also display partition/order expr, and we could also add simple tests for it in sqllogictests like

set datafusion.explain.format = tree; explain ...

2010YOUY01 · 2026-04-09T04:37:13Z

datafusion/physical-plan/src/sorts/partitioned_topk.rs

+    }
+
+    fn required_input_distribution(&self) -> Vec<Distribution> {
+        vec![Distribution::UnspecifiedDistribution]


I think this should be requiring a Hash partition scheme for the window partition key, the optimizer would use this API for sanity check during optimization.

2010YOUY01 · 2026-04-09T04:39:24Z

datafusion/physical-plan/src/sorts/partitioned_topk.rs

+        )?))
+    }
+
+    fn apply_expressions(


Not related to this PR, but I’m curious why this is a required ExecutionPlan API and when it is used, given that different operators can hold expressions for very different purposes 🤔

2010YOUY01 · 2026-04-09T04:48:11Z

datafusion/sqllogictest/test_files/window_topn.slt

+# Tests for Window TopN optimization: PartitionedTopKExec
+
+statement ok
+CREATE TABLE window_topn_t (id INT, pk INT, val INT) AS VALUES


I suggest moving the main test coverage here, instead of keeping it in unit tests across different layers such as optimizer tests. Once we have solid coverage here, it is less likely to get lost during local refactors.

We can also extend the coverage with more edge cases, for example:

predicates such as rn < 2, 2 > rn, etc.

mixing other window expressions with row_number()

empty or overlapping partition / order keys, such as ... OVER (ORDER BY id) or ... OVER (PARTITION BY id ORDER BY id, customer)

different sort options such as ASC, DESC, and NULLS FIRST

the QUALIFY clause https://datafusion.apache.org/user-guide/sql/select.html#qualify-clause

and more

Subham Singhal added 2 commits April 8, 2026 22:42

Benchmark window topn optimisation

38fa07a

Lint fix

52147dd

github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 8, 2026

2010YOUY01 reviewed Apr 9, 2026

View reviewed changes

2010YOUY01 changed the title ~~Benchmark: Window topn optimisation~~ Perf: Window topn optimisation Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf: Window topn optimisation#21479

Perf: Window topn optimisation#21479
SubhamSinghal wants to merge 2 commits intoapache:mainfrom
SubhamSinghal:window-topn-partitioned-topk-exec

SubhamSinghal commented Apr 8, 2026

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

SubhamSinghal Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

2010YOUY01 Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SubhamSinghal commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants