Skip to content

feat: Bump DataFusion to 53.1.0#160

Open
mwiewior wants to merge 6 commits into
masterfrom
upgrade-datafusion-53-1
Open

feat: Bump DataFusion to 53.1.0#160
mwiewior wants to merge 6 commits into
masterfrom
upgrade-datafusion-53-1

Conversation

@mwiewior
Copy link
Copy Markdown
Contributor

No description provided.

@mwiewior
Copy link
Copy Markdown
Contributor Author

@codex pls review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@mwiewior
Copy link
Copy Markdown
Contributor Author

@claude pls review

@claude
Copy link
Copy Markdown

claude Bot commented May 17, 2026

Claude finished @mwiewior's task in 2m 12s —— View job


PR Review: feat: Bump DataFusion to 53.1.0

Todo List:

  • Gather context and read changed files
  • Review Cargo.toml dependency changes
  • Review API changes in physical planners and exec files
  • Review example file changes
  • Review test changes
  • Summarize findings

Summary

This PR upgrades DataFusion from 50.3.0 → 53.1.0 (a 3-major-version jump), along with Arrow 56.1.0 → 58.3.0. The changes are largely mechanical API adaptations required by breaking changes in the DataFusion 51–53 release series, plus a pinned rev update for all datafusion-bio-formats git dependencies to a single consistent commit (1848d26). The migration appears correct and complete. A few items are worth noting.


Observations

✅ Correct mechanical adaptations

ExecutionPlan::properties() now returns &Arc<PlanProperties> — all 7 custom exec nodes updated consistently (cluster, complement, count_overlaps, merge, nearest, subtract, pileup, interval_join, annotate, kv_cache, variant_lookup). The pattern is uniform and correct.

statistics()partition_statistics(partition: Option<usize>) (interval_join.rs:569):
The rename is correct, and the added defensive guard is appropriate:

if stats.column_statistics.len() != self.join_schema.fields().len() {
    stats.column_statistics = Statistics::unknown_column(self.join_schema.as_ref());
}

estimate_join_statistics can return mismatched column counts for certain join types; this prevents a panic downstream in DataFusion's stats machinery.

Parquet API changes (cache_builder.rs):

  • SortingColumn moved from parquet::format to parquet::file::metadata — correct
  • SortingColumn::new(...) replaced with struct literal — correct (upstream removed the constructor)
  • set_max_row_group_size(n)set_max_row_group_row_count(Some(n)) — correct API rename
  • Test assertions updated to match (max_row_group_row_count() returning Option<usize>)

can_project API change: .as_ref().as_deref() correctly adapts to the new signature Option<&[usize]>.

hashes_buffer: &mut [u64] (interval_join.rs:910): Good Clippy improvement; &mut [T] is strictly more general than &mut Vec<T>.


⚠️ Items worth attention

1. Hidden behavior fix in from_nested_loop_join (bio_physical_planner.rs:138)

Before this PR, the NestedLoopJoin path silently dropped projections by always passing None:

// before
None,

After, it correctly reads the projection:

// after
join_exec.projection().clone().map(|projection| projection.to_vec()),

This is a correctness fix bundled with the DataFusion upgrade. It's a good change, but it's worth verifying that existing integration tests exercise the nested loop join path with projections — previously failing projections would have silently returned extra columns, and now the behavior is correct but different. Fix this →

2. default-features = false in root Cargo.toml (Cargo.toml:14)

The new dependency spec disables defaults and enables a specific feature set:

datafusion = { version = "53.1.0", default-features = false, features = [
    "nested_expressions", "crypto_expressions", "datetime_expressions",
    "encoding_expressions", "regex_expressions", "string_expressions",
    "unicode_expressions", "parquet", "recursive_protection", "sql"
]}

Features not included that were previously enabled by default: avro, backtrace, compression, yaml. This should be fine for this project, but confirm that downstream consumers (polars-bio) don't rely on the DataFusion compression feature being pulled in transitively. The recursive_protection feature is a good defensive inclusion.

3. New magic-number arguments to BamTableProvider::new()

Every call site now passes three new positional arguments:

BamTableProvider::new(bam_path, None, zero_based, None, true, true, 100, None)
//                                                              ^^^^ ^^^  ^^^
//                                                              new args from datafusion-bio-formats update

The true and 100 are opaque at the call site. Since these come from the upstream datafusion-bio-formats API, documentation there should clarify what they mean (likely some filter/batch-size parameters). The hardcoded 100 as a default in table_function.rs:130 is the most impactful since it affects the public depth() table function. Confirm this is an appropriate default.

4. CLAUDE.md version reference is stale

CLAUDE.md still documents:

- DataFusion: 50.3.0
- Arrow: 56.1.0

These should be updated to 53.1.0 and 58.3.0 respectively. This is low priority but will cause confusion for future contributors.

5. let mut reservationlet reservation (interval_join.rs:612)

This suppresses a legitimate unused-mut warning. Trivial but correct.


Verdict

The migration is mechanically correct and covers all required API surface changes consistently. The partition_statistics defensive guard and the projection fix in from_nested_loop_join are welcome improvements beyond the bare minimum. The main open question is around the BamTableProvider::new() default parameters and the default-features = false feature selection, both of which are worth a quick sanity check before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant