feat: Bump DataFusion to 53.1.0 by mwiewior · Pull Request #160 · biodatageeks/datafusion-bio-functions

mwiewior · 2026-05-17T10:56:49Z

No description provided.

mwiewior · 2026-05-17T11:27:56Z

@codex pls review

chatgpt-codex-connector · 2026-05-17T11:31:48Z

Codex Review: Didn't find any major issues. Swish!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

mwiewior · 2026-05-17T11:32:46Z

@claude pls review

claude · 2026-05-17T11:32:58Z

Claude finished @mwiewior's task in 2m 12s —— View job

PR Review: feat: Bump DataFusion to 53.1.0

Todo List:

Gather context and read changed files
Review Cargo.toml dependency changes
Review API changes in physical planners and exec files
Review example file changes
Review test changes
Summarize findings

Summary

This PR upgrades DataFusion from 50.3.0 → 53.1.0 (a 3-major-version jump), along with Arrow 56.1.0 → 58.3.0. The changes are largely mechanical API adaptations required by breaking changes in the DataFusion 51–53 release series, plus a pinned rev update for all datafusion-bio-formats git dependencies to a single consistent commit (1848d26). The migration appears correct and complete. A few items are worth noting.

Observations

✅ Correct mechanical adaptations

ExecutionPlan::properties() now returns &Arc<PlanProperties> — all 7 custom exec nodes updated consistently (cluster, complement, count_overlaps, merge, nearest, subtract, pileup, interval_join, annotate, kv_cache, variant_lookup). The pattern is uniform and correct.

statistics() → partition_statistics(partition: Option<usize>) (interval_join.rs:569):
The rename is correct, and the added defensive guard is appropriate:

if stats.column_statistics.len() != self.join_schema.fields().len() {
    stats.column_statistics = Statistics::unknown_column(self.join_schema.as_ref());
}

estimate_join_statistics can return mismatched column counts for certain join types; this prevents a panic downstream in DataFusion's stats machinery.

Parquet API changes (cache_builder.rs):

SortingColumn moved from parquet::format to parquet::file::metadata — correct
SortingColumn::new(...) replaced with struct literal — correct (upstream removed the constructor)
set_max_row_group_size(n) → set_max_row_group_row_count(Some(n)) — correct API rename
Test assertions updated to match (max_row_group_row_count() returning Option<usize>)

can_project API change: .as_ref() → .as_deref() correctly adapts to the new signature Option<&[usize]>.

hashes_buffer: &mut [u64] (interval_join.rs:910): Good Clippy improvement; &mut [T] is strictly more general than &mut Vec<T>.

⚠️ Items worth attention

1. Hidden behavior fix in from_nested_loop_join (bio_physical_planner.rs:138)

Before this PR, the NestedLoopJoin path silently dropped projections by always passing None:

// before
None,

After, it correctly reads the projection:

// after
join_exec.projection().clone().map(|projection| projection.to_vec()),

This is a correctness fix bundled with the DataFusion upgrade. It's a good change, but it's worth verifying that existing integration tests exercise the nested loop join path with projections — previously failing projections would have silently returned extra columns, and now the behavior is correct but different. Fix this →

2. default-features = false in root Cargo.toml (Cargo.toml:14)

The new dependency spec disables defaults and enables a specific feature set:

datafusion = { version = "53.1.0", default-features = false, features = [
    "nested_expressions", "crypto_expressions", "datetime_expressions",
    "encoding_expressions", "regex_expressions", "string_expressions",
    "unicode_expressions", "parquet", "recursive_protection", "sql"
]}

Features not included that were previously enabled by default: avro, backtrace, compression, yaml. This should be fine for this project, but confirm that downstream consumers (polars-bio) don't rely on the DataFusion compression feature being pulled in transitively. The recursive_protection feature is a good defensive inclusion.

3. New magic-number arguments to BamTableProvider::new()

Every call site now passes three new positional arguments:

BamTableProvider::new(bam_path, None, zero_based, None, true, true, 100, None)
//                                                              ^^^^ ^^^  ^^^
//                                                              new args from datafusion-bio-formats update

The true and 100 are opaque at the call site. Since these come from the upstream datafusion-bio-formats API, documentation there should clarify what they mean (likely some filter/batch-size parameters). The hardcoded 100 as a default in table_function.rs:130 is the most impactful since it affects the public depth() table function. Confirm this is an appropriate default.

4. CLAUDE.md version reference is stale

CLAUDE.md still documents:

- DataFusion: 50.3.0
- Arrow: 56.1.0

These should be updated to 53.1.0 and 58.3.0 respectively. This is low priority but will cause confusion for future contributors.

5. let mut reservation → let reservation (interval_join.rs:612)

This suppresses a legitimate unused-mut warning. Trivial but correct.

Verdict

The migration is mechanically correct and covers all required API surface changes consistently. The partition_statistics defensive guard and the projection fix in from_nested_loop_join are welcome improvements beyond the bare minimum. The main open question is around the BamTableProvider::new() default parameters and the default-features = false feature selection, both of which are worth a quick sanity check before merge.

Bump DataFusion to 53.1.0

d0dded2

mwiewior added 5 commits May 17, 2026 13:46

Clarify BAM provider defaults

8e7fd4c

Update bio-formats dependency

9b07dfd

Use bio-formats without aggregate noodles

2171d88

Coalesce complement input partitions

45c0f29

Preserve no-view complement partitioning

35a6a6e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Bump DataFusion to 53.1.0#160

feat: Bump DataFusion to 53.1.0#160
mwiewior wants to merge 6 commits into
masterfrom
upgrade-datafusion-53-1

mwiewior commented May 17, 2026

Uh oh!

mwiewior commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 17, 2026

Uh oh!

mwiewior commented May 17, 2026

Uh oh!

claude Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwiewior commented May 17, 2026

Uh oh!

mwiewior commented May 17, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 17, 2026

Uh oh!

mwiewior commented May 17, 2026

Uh oh!

claude Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: feat: Bump DataFusion to 53.1.0

Summary

Observations

✅ Correct mechanical adaptations

⚠️ Items worth attention

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented May 17, 2026 •

edited

Loading