Skip to content

Minimal query-aware statistics request hooks (extracted from #21996)#22300

Draft
adriangb wants to merge 6 commits into
apache:mainfrom
pydantic:statistics-request-hooks
Draft

Minimal query-aware statistics request hooks (extracted from #21996)#22300
adriangb wants to merge 6 commits into
apache:mainfrom
pydantic:statistics-request-hooks

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

This is not a replacement for #21996 — it is a minimal subset of it, carved out so the feature can be discussed/merged in smaller pieces.

Rationale for this change

#21996 ("Query-aware statistics requests via ScanArgs / ScanResult") is a full vertical slice: new statistics types, request threading optimizer → planner → provider, a built-in RequestStatistics optimizer rule, and a consumer integration (FilePruner / ListingTable).

This PR extracts only the framework hooks — just enough that the rest can be implemented entirely outside of DataFusion. A third party can write their own optimizer rule to derive statistics requests, and their own TableProvider to consume them, without DataFusion shipping any rule or consumer of its own.

In stock DataFusion nothing observable changes: no rule populates the new field, and the built-in providers ignore it.

What changes are included in this PR?

Five small, independently-reviewable commits:

  1. refactor: add TableScanBuilder, deprecate TableScan::try_newTableScan::try_new takes five positional args and bare TableScan { .. } literals are fragile to field additions. Introduce TableScanBuilder (with From<TableScan>), move schema derivation into build(), deprecate try_new (delegates to the builder), migrate all in-tree callers. Pure refactor.
  2. feat: add StatisticsRequest / StatisticsValue / SatisfiedStatistics — new public vocabulary types in datafusion-expr-common::statistics. Nothing consumes them yet.
  3. feat: add TableScan::statistics_requests field — an advisory Vec<StatisticsRequest> on TableScan, settable via TableScan::with_statistics_requests / TableScanBuilder. Empty by default; DataFusion's own rules never populate it.
  4. feat: thread statistics requests into ScanArgsScanArgs gains statistics_requests; the physical planner threads TableScan::statistics_requests into it so the request reaches TableProvider::scan_with_args.
  5. test: e2e statistics-request flow via a custom optimizer rule — an integration test playing both external roles.

Deliberately left out vs #21996: the built-in RequestStatistics optimizer rule, the FilePruner / ListingTable consumer integration, the PartitionedFile::satisfied_stats per-file response field, and StatisticsValue::Distribution (which would depend on the now-deprecated Distribution type).

Are these changes tested?

Yes:

  • datafusion-expr-common: a unit test that StatisticsRequest is hashable / usable as a HashMap key.
  • datafusion/core/tests/user_defined/statistics_requests.rs: an end-to-end integration test where a custom OptimizerRule annotates TableScan and a custom TableProvider asserts the requests reach scan_with_args — plus a test that without such a rule the provider sees an empty request list.
  • All existing datafusion-expr / datafusion-optimizer / datafusion-proto tests pass against the TableScanBuilder refactor.

Are there any user-facing changes?

Yes — this needs the api change label:

  • New public types StatisticsRequest, StatisticsValue, SatisfiedStatistics (re-exported via datafusion_expr::statistics).
  • New TableScanBuilder; TableScan::try_new is deprecated (still works, delegates to the builder).
  • TableScan gains a new public field statistics_requests — this breaks exhaustive TableScan { .. } struct literals downstream (the recommended fix is TableScanBuilder).
  • ScanArgs gains with_statistics_requests / statistics_requests.

🤖 Generated with Claude Code

adriangb and others added 5 commits May 17, 2026 00:42
`TableScan::try_new` takes five positional arguments and bare
`TableScan { .. }` struct literals are scattered across the codebase,
making both fragile to field additions.

Introduce `TableScanBuilder` (with `From<TableScan>`, so an existing
scan can be decomposed, tweaked, and rebuilt) and move the
schema-derivation logic into `build()`. `TableScan::try_new` is now
deprecated and delegates to the builder; all in-tree callers are
migrated to the builder. Pure refactor, no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a query-aware statistics vocabulary to `datafusion-expr-common`:

- `StatisticsRequest` — a statistic a caller would like a provider to
  supply if it can do so cheaply (Min/Max/NullCount/DistinctCount/Sum/
  ByteSize per column, plus RowCount and TotalByteSize).
- `StatisticsValue` — the response value paired 1:1 with a request.
- `SatisfiedStatistics` — a sparse `HashMap<StatisticsRequest,
  StatisticsValue>` of provider answers.

These are intentionally just a vocabulary; nothing in DataFusion
populates or consumes them yet. They are re-exported via
`datafusion_expr::statistics`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add an advisory `statistics_requests: Vec<StatisticsRequest>` field to
`TableScan`. A custom optimizer rule can attach the statistics the
surrounding plan shape would benefit from (e.g. Min/Max for sort keys)
via `TableScan::with_statistics_requests` or the new
`TableScanBuilder::with_statistics_requests`; the physical planner will
thread them into the table provider (next commit).

The field is empty by default and DataFusion's own rules never populate
it. `Debug`/`PartialEq`/`Eq`/`Hash`/`PartialOrd` for `TableScan` are
left unchanged — it is advisory metadata, not part of plan identity.

`map_expressions` in `tree_node.rs` is rewritten to rebuild `TableScan`
via `..scan` instead of an exhaustive destructure, so it carries this
(and any future) field through untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a `statistics_requests` field to `ScanArgs` (with
`with_statistics_requests` / `statistics_requests` accessors) and have
the physical planner thread `TableScan::statistics_requests` into it.

This completes the request-side path: a custom optimizer rule annotates
`TableScan`, and the request reaches a custom `TableProvider` in
`scan_with_args`. DataFusion's own providers ignore the field; the
default `ScanArgs` value is an empty slice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a self-contained integration test that plays both external roles:
a custom `OptimizerRule` annotates each `TableScan` with
`StatisticsRequest`s, and a custom `TableProvider` records the
`ScanArgs::statistics_requests` it receives in `scan_with_args`.

This demonstrates the request-side hooks are sufficient to build the
feature entirely outside of DataFusion. A second test confirms that
without such a rule the provider sees an empty request list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate catalog Related to the catalog crate proto Related to proto crate labels May 17, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion v53.1.0 (current)
       Built [  95.374s] (current)
     Parsing datafusion v53.1.0 (current)
      Parsed [   0.032s] (current)
    Building datafusion v53.1.0 (baseline)
       Built [  92.196s] (baseline)
     Parsing datafusion v53.1.0 (baseline)
      Parsed [   0.032s] (baseline)
    Checking datafusion v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.657s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 190.197s] datafusion
    Building datafusion-catalog v53.1.0 (current)
       Built [  36.293s] (current)
     Parsing datafusion-catalog v53.1.0 (current)
      Parsed [   0.025s] (current)
    Building datafusion-catalog v53.1.0 (baseline)
       Built [  36.403s] (baseline)
     Parsing datafusion-catalog v53.1.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-catalog v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.160s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.383s] datafusion-catalog
    Building datafusion-expr v53.1.0 (current)
       Built [  25.860s] (current)
     Parsing datafusion-expr v53.1.0 (current)
      Parsed [   0.069s] (current)
    Building datafusion-expr v53.1.0 (baseline)
       Built [  25.753s] (baseline)
     Parsing datafusion-expr v53.1.0 (baseline)
      Parsed [   0.070s] (baseline)
    Checking datafusion-expr v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.387s] 222 checks: 220 pass, 2 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2790
  field TableScan.statistics_requests in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2790

--- failure type_method_marked_deprecated: type method #[deprecated] added ---

Description:
A type method is now #[deprecated]. Downstream crates will get a compiler warning when using this method.
        ref: https://doc.rust-lang.org/reference/attributes/diagnostics.html#the-deprecated-attribute
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/type_method_marked_deprecated.ron

Failed in:
  method datafusion_expr::logical_plan::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2866
  method datafusion_expr::TableScan::try_new in /home/runner/work/datafusion/datafusion/datafusion/expr/src/logical_plan/plan.rs:2866

     Summary semver requires new major version: 1 major and 1 minor checks failed
    Finished [  54.570s] datafusion-expr
    Building datafusion-expr-common v53.1.0 (current)
       Built [  18.464s] (current)
     Parsing datafusion-expr-common v53.1.0 (current)
      Parsed [   0.017s] (current)
    Building datafusion-expr-common v53.1.0 (baseline)
       Built [  18.285s] (baseline)
     Parsing datafusion-expr-common v53.1.0 (baseline)
      Parsed [   0.017s] (baseline)
    Checking datafusion-expr-common v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.219s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  38.144s] datafusion-expr-common
    Building datafusion-optimizer v53.1.0 (current)
       Built [  26.166s] (current)
     Parsing datafusion-optimizer v53.1.0 (current)
      Parsed [   0.027s] (current)
    Building datafusion-optimizer v53.1.0 (baseline)
       Built [  25.869s] (baseline)
     Parsing datafusion-optimizer v53.1.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.191s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  53.487s] datafusion-optimizer
    Building datafusion-proto v53.1.0 (current)
       Built [  52.588s] (current)
     Parsing datafusion-proto v53.1.0 (current)
      Parsed [   0.130s] (current)
    Building datafusion-proto v53.1.0 (baseline)
       Built [  52.596s] (baseline)
     Parsing datafusion-proto v53.1.0 (baseline)
      Parsed [   0.130s] (baseline)
    Checking datafusion-proto v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   1.963s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 109.509s] datafusion-proto

@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 17, 2026
@adriangb adriangb added the api change Changes the API exposed to users of the crate label May 17, 2026
`cargo doc -D warnings` (CI) rejected two intra-doc links:

- `crate::TableProvider` in `datafusion-expr` — `TableProvider` lives in
  `datafusion-catalog` and is not reachable from `datafusion-expr`;
  demoted to a plain code span.
- `Self::statistics_requests` on `ScanArgs` was ambiguous between the
  private field and the public accessor; disambiguated to
  `Self::statistics_requests()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate auto detected api change Auto detected API change catalog Related to the catalog crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant