[DISCUSSION] Extending Partitioning to Support More Variants

I’d like to restart the partitioning side of this discussion separately from the dynamic-filter PRs and threads (see #21207 for more background)

The main issue seems to be that DataFusion cannot always represent the physical partitioning that some data sources actually have. In our case, range-partitioned data has had to claim `Partitioning::Hash`, which lets us avoid repartitioning but makes later optimizer and dynamic-filter decisions brittle due to this false advertisement.

Today we are in this scenario:

```text
Data is actually range-partitioned:
- partition 0: key < 100
- partition 1: 100 <= key < 200
- partition 2: 200 <= key < 300

To mimic this we tell DataFusion our partitioning properties are:
- Partitioning::Hash([key], 3)
```

Those are not the same partitioning scheme.

The general goal should be that DataFusion should be able to describe physical partitioning truthfully, be able to compare two partitionings for compatibility, and use that information only when compatibility is proven.

This is well shown in dynamic filtering:

```text
Build side partitioning:
  p0: key < 100
  p1: 100 <= key < 200
  p2: 200 <= key < 300

Probe side partitioning:
  p0: key < 100
  p1: 100 <= key < 200 
  p2: 200 <= key < 300
```

These are compatible as every partition X on the build side is compatible with partition X on the probe side. With this information, optimizations can safely use partition-local behavior:

```text
if partitioning is compatible:
  use partition-local filter (more selective)
else:
  use global/safe fallback
```

A possible direction is to evolve Partitioning toward an extensible abstraction, such as a `PhysicalPartitioning` trait, and add range partitioning as the first use case. Longer term, built-ins like hash partitioning could move behind the same abstraction.

This would give us a cleaner foundation for:
- range/file partitioned sources
- partition-local dynamic filters when both sides share the same partition map #21832

Does this direction seem interesting to others in the community nd any thoughts on this proposal?

cc: @NGA-TRAN @alamb @jayshrivastava @gabotechs @adriangb 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSSION] Extending Partitioning to Support More Variants #21992

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[DISCUSSION] Extending Partitioning to Support More Variants #21992

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions