Skip to content

[DISCUSSION] Extending Partitioning to Support More Variants #21992

@gene-bordegaray

Description

@gene-bordegaray

I’d like to restart the partitioning side of this discussion separately from the dynamic-filter PRs and threads (see #21207 for more background)

The main issue seems to be that DataFusion cannot always represent the physical partitioning that some data sources actually have. In our case, range-partitioned data has had to claim Partitioning::Hash, which lets us avoid repartitioning but makes later optimizer and dynamic-filter decisions brittle due to this false advertisement.

Today we are in this scenario:

Data is actually range-partitioned:
- partition 0: key < 100
- partition 1: 100 <= key < 200
- partition 2: 200 <= key < 300

To mimic this we tell DataFusion our partitioning properties are:
- Partitioning::Hash([key], 3)

Those are not the same partitioning scheme.

The general goal should be that DataFusion should be able to describe physical partitioning truthfully, be able to compare two partitionings for compatibility, and use that information only when compatibility is proven.

This is well shown in dynamic filtering:

Build side partitioning:
  p0: key < 100
  p1: 100 <= key < 200
  p2: 200 <= key < 300

Probe side partitioning:
  p0: key < 100
  p1: 100 <= key < 200 
  p2: 200 <= key < 300

These are compatible as every partition X on the build side is compatible with partition X on the probe side. With this information, optimizations can safely use partition-local behavior:

if partitioning is compatible:
  use partition-local filter (more selective)
else:
  use global/safe fallback

A possible direction is to evolve Partitioning toward an extensible abstraction, such as a PhysicalPartitioning trait, and add range partitioning as the first use case. Longer term, built-ins like hash partitioning could move behind the same abstraction.

This would give us a cleaner foundation for:

Does this direction seem interesting to others in the community nd any thoughts on this proposal?

cc: @NGA-TRAN @alamb @jayshrivastava @gabotechs @adriangb

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions