Skip to content

Minimal query-aware statistics request hooks (extracted from #21996)#22300

Open
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:statistics-request-hooks
Open

Minimal query-aware statistics request hooks (extracted from #21996)#22300
adriangb wants to merge 5 commits into
apache:mainfrom
pydantic:statistics-request-hooks

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented May 17, 2026

Which issue does this PR close?

This is not a replacement for #21996 — it is a minimal subset of it, carved out so the feature can be discussed/merged in smaller pieces.

Rationale for this change

#21996 ("Query-aware statistics requests via ScanArgs / ScanResult") is a full vertical slice: new statistics types, request threading optimizer → planner → provider, a built-in RequestStatistics optimizer rule, and a consumer integration (FilePruner / ListingTable).

This PR extracts only the framework hooks — just enough that the rest can be implemented entirely outside of DataFusion. A third party can write their own optimizer rule to derive statistics requests, and their own TableProvider to consume them, without DataFusion shipping any rule or consumer of its own.

In stock DataFusion nothing observable changes: no rule populates the new field, and the built-in providers ignore it.

What changes are included in this PR?

Five small, independently-reviewable commits:

  1. refactor: add TableScanBuilder, deprecate TableScan::try_newTableScan::try_new takes five positional args and bare TableScan { .. } literals are fragile to field additions. Introduce TableScanBuilder (with From<TableScan>), move schema derivation into build(), deprecate try_new (delegates to the builder), migrate all in-tree callers. Pure refactor.
  2. feat: add StatisticsRequest type — a new public StatisticsRequest enum in datafusion-expr-common::statistics (Min/Max/NullCount/DistinctCount/Sum/ByteSize per column, plus RowCount/TotalByteSize). Nothing consumes it yet.
  3. feat: add TableScan::statistics_requests field — an advisory BTreeSet<StatisticsRequest> on TableScan. A set so request-deriving optimizer rules stay idempotent under fixpoint iteration (a rule inserts its requests; re-running is a no-op and composes with other rules — no per-rule dedup). Empty by default; DataFusion's own rules never populate it.
  4. feat: thread statistics requests into ScanArgsScanArgs gains statistics_requests; the physical planner threads TableScan::statistics_requests into it so the request reaches TableProvider::scan_with_args.
  5. test: e2e statistics-request flow via a custom optimizer rule — an integration test playing both external roles.

Deliberately left out vs #21996, since this PR is request-side only: the built-in RequestStatistics optimizer rule, the FilePruner / ListingTable consumer integration, the PartitionedFile::satisfied_stats per-file response field, and the response-side types (StatisticsValue / SatisfiedStatistics). Those belong with whatever actually wires the response side.

Are these changes tested?

Yes:

  • datafusion/core/tests/user_defined/statistics_requests.rs: an end-to-end integration test where a custom OptimizerRule annotates TableScan and a custom TableProvider asserts the requests reach scan_with_args — plus a test that without such a rule the provider sees an empty request list.
  • All existing datafusion-expr / datafusion-optimizer / datafusion-proto tests pass against the TableScanBuilder refactor.

Are there any user-facing changes?

Yes — this needs the api change label:

  • New public type StatisticsRequest (re-exported via datafusion_expr::statistics).
  • New TableScanBuilder; TableScan::try_new is deprecated (still works, delegates to the builder).
  • TableScan gains a new public field statistics_requests — this breaks exhaustive TableScan { .. } struct literals downstream (the recommended fix is TableScanBuilder).
  • ScanArgs gains with_statistics_requests / statistics_requests.

🤖 Generated with Claude Code

`TableScan::try_new` takes five positional arguments and bare
`TableScan { .. }` struct literals are scattered across the codebase,
making both fragile to field additions.

Introduce `TableScanBuilder` (with `From<TableScan>`, so an existing
scan can be decomposed, tweaked, and rebuilt) and move the
schema-derivation logic into `build()`. `TableScan::try_new` is now
deprecated and delegates to the builder; all in-tree callers are
migrated to the builder. Pure refactor, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels May 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  98.843s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.034s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  95.325s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.039s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.615s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 196.839s] datafusion
    Building datafusion-catalog v53.1.0 (current)
       Built [  36.460s] (current)
     Parsing datafusion-catalog v53.1.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-catalog v53.1.0 (baseline)
       Built [  37.035s] (baseline)
     Parsing datafusion-catalog v53.1.0 (baseline)
      Parsed [   0.026s] (baseline)
    Checking datafusion-catalog v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.141s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  75.534s] datafusion-catalog
    Building datafusion-expr v53.1.0 (current)
       Built [  26.463s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.072s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  26.425s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.072s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.222s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2796
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2796

--- failure type_method_marked_deprecated: type method #[deprecated] added ---

Description:
A type method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/type_method_marked_deprecated.ron

Failed in:
  method datafusion_expr::logical_plan::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2872
  method datafusion_expr::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2872

     Summary semver requires new major version: 1 major and 1 minor checks failed
    Finished [  55.647s] datafusion-expr
    Building datafusion-expr-common v53.1.0 (current)
       Built [  18.826s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.017s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  18.595s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.018s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.209s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  38.825s] datafusion-expr-common
    Building datafusion-optimizer v53.1.0 (current)
       Built [  26.075s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  26.581s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.029s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.172s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  53.904s] datafusion-optimizer
    Building datafusion-proto v53.1.0 (current)
       Built [  53.768s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.132s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  53.670s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.136s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.667s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 111.506s] datafusion-proto

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 17, 2026
@adriangb adriangb added the api change Changes the API exposed to users of the crate label May 17, 2026
@adriangb adriangb force-pushed the statistics-request-hooks branch from 0ef815a to e2838e8 Compare May 17, 2026 15:40
@adriangb adriangb requested a review from alamb May 17, 2026 15:48
@adriangb
Copy link
Copy Markdown
Contributor Author

cc @asolimando

@adriangb adriangb force-pushed the statistics-request-hooks branch 2 times, most recently from 070700d to ae549d8 Compare May 17, 2026 16:26
adriangb and others added 3 commits May 17, 2026 09:32
Add `StatisticsRequest` to `datafusion-expr-common::statistics` — a small
vocabulary for query-aware statistics: a caller can ask a provider for a
specific statistic (Min/Max/NullCount/DistinctCount/Sum/ByteSize per
column, plus RowCount and TotalByteSize) instead of for a dense
`Statistics` covering every column.

It derives `Ord` so requests can be held in a deduplicating,
deterministically-ordered `BTreeSet` (see the next commit).

It is intentionally just a vocabulary; nothing in DataFusion populates or
consumes it yet. Re-exported via `datafusion_expr::statistics`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an advisory `statistics_requests: BTreeSet<StatisticsRequest>` field
to `TableScan`. A custom optimizer rule can record the statistics the
surrounding plan shape would benefit from (e.g. Min/Max for sort keys);
the physical planner threads them into the table provider (next commit).

A `BTreeSet` rather than a `Vec`: the optimizer runs rules to a fixpoint,
so the request collection must be idempotent under re-derivation. With a
set a rule simply `insert`s its requests — re-running is a no-op and
composes with other rules — instead of every rule having to dedupe a
`Vec` itself. `BTree` ordering also keeps plans deterministic.

The field is empty by default and DataFusion's own rules never populate
it. It can be set when building a scan via
`TableScanBuilder::with_statistics_requests`, or mutated directly (it is
`pub`, like `TableScan`'s other fields). `Debug`/`PartialEq`/`Eq`/`Hash`/
`PartialOrd` for `TableScan` are left unchanged — it is advisory
metadata, not part of plan identity.

`map_expressions` in `tree_node.rs` is rewritten to rebuild `TableScan`
via `..scan` instead of an exhaustive destructure, so it carries this
(and any future) field through untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `statistics_requests` field to `ScanArgs` (with
`with_statistics_requests` / `statistics_requests` accessors) and have
the physical planner thread `TableScan::statistics_requests` into it.

This completes the request-side path: a custom optimizer rule annotates
`TableScan`, and the request reaches a custom `TableProvider` in
`scan_with_args`. DataFusion's own providers ignore the field; the
default `ScanArgs` value is an empty slice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the statistics-request-hooks branch from ae549d8 to 9c6bbc9 Compare May 17, 2026 16:32
Add a self-contained integration test that plays both external roles:
a custom `OptimizerRule` annotates each `TableScan` with
`StatisticsRequest`s, and a custom `TableProvider` records the
`ScanArgs::statistics_requests` it receives in `scan_with_args`.

This demonstrates the request-side hooks are sufficient to build the
feature entirely outside of DataFusion. A second test confirms that
without such a rule the provider sees an empty request list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the statistics-request-hooks branch from 9c6bbc9 to d20c5aa Compare May 17, 2026 16:41
@adriangb adriangb marked this pull request as ready for review May 17, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate auto detected api change Auto detected API change catalog Related to the catalog crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant