Skip to content

fix: NaN pushdown correctly pushes down NaNs correctness issue #2351

Merged
blackmwk merged 8 commits intoapache:mainfrom
xanderbailey:xb/fix_nan_pushdown
Apr 22, 2026
Merged

fix: NaN pushdown correctly pushes down NaNs correctness issue #2351
blackmwk merged 8 commits intoapache:mainfrom
xanderbailey:xb/fix_nan_pushdown

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented Apr 20, 2026

Which issue does this PR close?

Not tied to a specific issue - found during an audit of pushdown filter gaps.

What changes are included in this PR?

PredicateConverter::is_nan and not_nan were never actually implemented - is_nan returned always_true (matches every row) and not_nan returned always_false (matches no rows).

Every other predicate in PredicateConverter projects the column from the batch and runs an arrow compute kernel, but these two just returned constants.

This adds a compute_is_nan helper that downcasts to Float32Array/Float64Array and checks each value with f.is_nan(), preserving nulls. Non-float types return all false. is_nan and not_nan now use it the same way is_null/not_null use arrow::is_null/is_not_null.

Are these changes tested?

Yes, test added

@xanderbailey
Copy link
Copy Markdown
Contributor Author

Could be a candidate for #2303

@mbutrovich mbutrovich self-requested a review April 20, 2026 21:48
@mbutrovich
Copy link
Copy Markdown
Collaborator

Thanks for catching this, @xanderbailey! The always_true/always_false stubs were a real correctness gap, and the fix is well-structured — mirrors the is_null/not_null pattern nicely.

A few things I noticed:

NULL propagation semantics

BooleanArray::from_unary preserves the null bitmap from the input array (source). This means for a NULL input value, the output is NULL rather than a concrete boolean.

The Iceberg spec (and the Java implementation in NaNUtil.isNaN + Evaluator) treats these as non-null-propagating:

Input is_nan not_nan
NaN true false
value false true
NULL false true

With null propagation, not_nan(NULL) produces NULL, which gets filtered out. But per spec it should produce true, meaning NULL rows should pass through a not_nan filter. This matches Spark's behavior too — IsNaN declares nullable = false and returns false for NULL input.

One approach: build the array without null propagation, e.g. iterate and treat null slots explicitly as "not NaN". Or apply from_unary and then override null positions afterward.

Non-float fallback

The _ => Ok(BooleanArray::from(vec![false; array.len()])) arm silently produces all-false for non-float types. Since the Iceberg type system only allows float and double, and is_nan can only bind against those types, this branch should be unreachable. The codebase uses unreachable! elsewhere for similar invariants (e.g., crates/iceberg/src/arrow/schema.rs). Might be worth using unreachable!("is_nan is only valid for float types") here to make the invariant explicit.

Test assertions

The test currently spot-checks indices 0, 1, and 3 but skips the NULL at index 2. It might be worth asserting the full truth table from the spec — something like:

// Input: [Some(1.0), Some(NaN), None, Some(0.0)]

// is_nan: NaN→true, value→false, NULL→false (not null-propagating per spec)
let result = apply_predicate_to_batch(Reference::new("qux").is_nan(), schema.clone(), batch);
assert!(!result.is_null(0));
assert!(!result.value(0));     // 1.0 is not NaN
assert!(!result.is_null(1));
assert!(result.value(1));      // NaN is NaN
assert!(!result.is_null(2));
assert!(!result.value(2));     // NULL → false, not NULL
assert!(!result.is_null(3));
assert!(!result.value(3));     // 0.0 is not NaN

// not_nan: NaN→false, value→true, NULL→true (not null-propagating per spec)
let result = apply_predicate_to_batch(Reference::new("qux").is_not_nan(), schema, batch);
assert!(!result.is_null(0));
assert!(result.value(0));      // 1.0 is not NaN
assert!(!result.is_null(1));
assert!(!result.value(1));     // NaN is NaN
assert!(!result.is_null(2));
assert!(result.value(2));      // NULL → true, not NULL
assert!(!result.is_null(3));
assert!(result.value(3));      // 0.0 is not NaN

This locks in the non-null-propagating semantics from the spec table above, and would catch the current from_unary null propagation issue.

Comment thread crates/iceberg/src/arrow/reader.rs Outdated
};

let values = match nulls {
Some(nulls) => &values & nulls.inner(),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nulls.inner() is an "is not null" mask

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xanderbailey for this fix!

@blackmwk blackmwk merged commit 1bed7b6 into apache:main Apr 22, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants