Skip to content

Migrate UnKnownColumn proto hooks#22464

Open
koopatroopa787 wants to merge 8 commits into
apache:mainfrom
koopatroopa787:fix/unknown-column-proto-hooks
Open

Migrate UnKnownColumn proto hooks#22464
koopatroopa787 wants to merge 8 commits into
apache:mainfrom
koopatroopa787:fix/unknown-column-proto-hooks

Conversation

@koopatroopa787
Copy link
Copy Markdown

@koopatroopa787 koopatroopa787 commented May 22, 2026

Which issue does this PR close?

Rationale for this change

This issue is part of the migration to the new proto hooks ( ry_to_proto / ry_from_proto). UnKnownColumn was still handled in the legacy match arms, so this ports it to the hook-based path for consistency with other migrated expressions (e.g. Column, BinaryExpr).

What changes are included in this PR?

  • Add ry_to_proto / ry_from_proto implementations for UnKnownColumn in physical-expr.
  • Remove the legacy UnKnownColumn match arms in datafusion/proto physical plan conversion.

Are these changes tested?

  • cargo test -p datafusion-proto --lib

Are there any user-facing changes?

No user-facing changes.

Copilot AI review requested due to automatic review settings May 22, 2026 17:03
@github-actions github-actions Bot added physical-expr Changes to the physical-expr crates proto Related to proto crate labels May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Moves protobuf (de)serialization for UnKnownColumn out of the generic physical-plan proto glue and into the expression type itself (proto feature-gated), aligning with the “each expr owns its own proto” direction.

Changes:

  • Remove UnKnownColumn special-casing from serialize_physical_expr_with_converter in to_proto.rs.
  • Route UnknownColumn decoding through UnKnownColumn::try_from_proto in from_proto.rs.
  • Implement try_to_proto / try_from_proto for UnKnownColumn behind the proto feature.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
datafusion/proto/src/physical_plan/to_proto.rs Removes inlined UnKnownColumn serialization path.
datafusion/proto/src/physical_plan/from_proto.rs Switches decoding to UnKnownColumn::try_from_proto.
datafusion/physical-expr/src/expressions/unknown_column.rs Adds proto encode/decode implementations for UnKnownColumn.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +96 to +103
Ok(Some(protobuf::PhysicalExprNode {
expr_id: None,
expr_type: Some(protobuf::physical_expr_node::ExprType::UnknownColumn(
protobuf::UnknownColumn {
name: self.name.clone(),
},
)),
}))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while the previous inlined serialization path in to_proto.rs populated expr_id from the serializer’s expr_id

I believe this variable was an option and was None since only dynamic filters set expr_id

Comment on lines +114 to +117
let protobuf::UnknownColumn { name } = match &node.expr_type {
Some(protobuf::physical_expr_node::ExprType::UnknownColumn(c)) => c,
_ => return internal_err!("PhysicalExprNode is not an UnKnownColumn"),
};
@kumarUjjawal
Copy link
Copy Markdown
Contributor

Hi @koopatroopa787 Thank you for picking this up. Can you please update the PR body with more details. Use the pr template for this.

@koopatroopa787
Copy link
Copy Markdown
Author

Hi @koopatroopa787 Thank you for picking this up. Can you please update the PR body with more details. Use the pr template for this.

Hii @kumarUjjawal, updated based on the PR template, please have a look and let me know if any changes needed.

Copy link
Copy Markdown
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

use datafusion_common::{Result, internal_err};

#[cfg(feature = "proto")]
use datafusion_proto_models::protobuf;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you can move this use statement into the try_to_proto and try_from_proto methods

@koopatroopa787 koopatroopa787 force-pushed the fix/unknown-column-proto-hooks branch from c6b0b8b to bfa248a Compare May 24, 2026 02:05
@koopatroopa787
Copy link
Copy Markdown
Author

koopatroopa787 commented May 24, 2026

@jayshrivastava Thanks — I updated the decode path to include the actual xpr_type and xpr_id in the error, so mismatched proto nodes are easier to diagnose. I also moved the proto import into the ry_to_proto / ry_from_proto methods as suggested, and rebased the branch onto the latest main.

@kumarUjjawal
Copy link
Copy Markdown
Contributor

@koopatroopa787 we should add a datafusion-proto roundtrip test that serializes and deserializes a partitioning or plan containing UnKnownColumn.

@koopatroopa787
Copy link
Copy Markdown
Author

Moved the #[cfg(test)] mod tests to the end of unknown_column.rs so impl Hash/PartialEq come first (resolves clippy 'items-after-test-module'). Ran cargo clippy locally and the lint is fixed. Pushed the change to fix/unknown-column-proto-hooks. Next I'll add the requested datafusion-proto roundtrip test and re-run CI; let me know if you'd prefer using #[allow(clippy::items_after_test_module)] instead.

@koopatroopa787
Copy link
Copy Markdown
Author

Added a proto roundtrip unit test for UnKnownColumn (unknown_column_proto_roundtrip) and a small stabilization fix for a flaky statistics test (compareBernoulli p with tolerance). Ran cargo test and cargo clippy locally; all checks for datafusion-physical-expr pass. Pushed updates to fix/unknown-column-proto-hooks.

@koopatroopa787
Copy link
Copy Markdown
Author

@kumarUjjawal Addressed. I added unknown_column_proto_roundtrip to verify deserialize behavior for UnKnownColumn via the new proto hook path, and pushed it on fix/unknown-column-proto-hooks (commit 9e6fce9). I also fixed the clippy failure (items-after-test-module) by moving the test module after impls. Local cargo clippy -p datafusion-physical-expr -- -D warnings and cargo test -p datafusion-physical-expr --lib --features proto pass.

@koopatroopa787
Copy link
Copy Markdown
Author

All CI checks are passing and the PR has been approved by @jayshrivastava. Could a committer please take a look when you get a chance? Happy to make any changes needed.

@kumarUjjawal
Copy link
Copy Markdown
Contributor

I think the "roundtrip" test doesn't round-trip. It builds a proto node by hand and only calls try_from_proto. It never calls try_to_proto and never goes through the proto crate's central serialize/parse path?

Comment on lines +148 to +154
use std::sync::Arc;

#[cfg(feature = "proto")]
use arrow::datatypes::Schema;

#[cfg(feature = "proto")]
use datafusion_common::{Result, internal_err};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need these re-imports since we are already using super?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these explicit imports are still needed. use super::* only brings in items defined in the parent module (like UnKnownColumn itself) — it does NOT re-export items that are only privately used in the parent scope (like Arc, Schema, Result, internal_err). You can see the same pattern in binary.rs's test module, which also uses use super::* but still explicitly imports use arrow::array::BooleanArray and others. Removing these would cause compile errors when the proto feature is enabled.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/datafusion/physical-expr/src/expressions/like.rs declares its own module for proto tests. We could follow that pattern

#[cfg(all(test, feature = "proto"))]
mod proto_tests {

}

Distribution::new_bernoulli(ScalarValue::from(15.0 / 16.0))?
);
let got = neq.evaluate_statistics(&[left_stat, right_stat])?;
let expected = Distribution::new_bernoulli(ScalarValue::from(15.0 / 16.0))?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this test change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a floating-point stabilization fix I added while debugging a local CI run, but it doesn't belong here. I've reverted it in the latest commit and restored the original assert_eq!. The PR now only contains the UnKnownColumn proto hook migration as intended.

@koopatroopa787 koopatroopa787 force-pushed the fix/unknown-column-proto-hooks branch from 6272761 to 0ca8aa3 Compare May 25, 2026 13:21

let decoded = UnKnownColumn::try_from_proto(&node, &ctx).unwrap();
let col = decoded.as_ref().downcast_ref::<UnKnownColumn>().unwrap();
assert_eq!(col.name(), name.as_str());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what Kumar means by roundtrip is something like

    use crate::proto_test_util::{
        StubDecoder, StubEncoder
    };

    #[test]
    fn unknown_column_proto_roundtrip() {
        let expr = UnKnownColumn::new("col_a");
        let encoder = StubEncoder::ok();
        let ctx = PhysicalExprEncodeCtx::new(&encoder);
        let encoded = expr.try_to_proto(&ctx).unwrap().unwrap();

        let decode_ctx = PhysicalExprDecodeCtx::new(&Schema::empty(), &UnreachableDecoder {});
        let decoded = UnKnownColumn::try_from_proto(&encoded, &decode_ctx).unwrap();
        let roundtripped_expr = decoded.as_ref().downcast_ref::<UnKnownColumn>().unwrap();
        assert_eq!(&expr, roundtripped_expr);
    }

I approved the PR bc I figured there would already be a roundtrip test in datafusion/datafusion/proto/tests/cases/roundtrip_physical_plan.rs but that doesn't seem to be the case.

I agree with Kumar that we should add some here.

)),
};
let schema = Schema::empty();
let decoder = DummyDecode;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koopatroopa787
Copy link
Copy Markdown
Author

Alright so after doing another review I realized the issue with roundtrip. It only called try_from_proto on a hand-built node, never going through try_to_proto. Rewrote the whole test module to follow the like.rs pattern:

try_to_proto_encodes_unknown_column — calls try_to_proto, destructures the result, checks the name is encoded correctly and expr_id is None
try_from_proto_decodes_name — builds a proto node by hand and verifies decoding
try_from_proto_rejects_non_unknown_column_node — passes a wrong-type node, checks the error is DataFusionError::Internal with the actual expr_type in the message
unknown_column_proto_roundtrip — the actual roundtrip, try_to_proto output fed directly into try_from_proto
Also replaced the hand-rolled DummyDecode with UnreachableDecoder and StubEncoder from crate::proto_test_util, and moved the module to #[cfg(all(test, feature = "proto"))] mod proto_tests to match like.rs.

On the imports: use super::* only brings in items defined in the parent module, not private use aliases like Arc, Schema, Result, internal_err. The new proto_tests module imports everything explicitly now anyway, so it's consistent with like.rs regardless.

Kanishk Sachan and others added 7 commits May 25, 2026 16:44
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Kanishk Sachan <koopatroopa787>
…ems-after-test-module lint

Closes: address clippy failure in CI

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…equality in binary statistics test

Closes: address requested roundtrip test and stabilize failing test

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@koopatroopa787 koopatroopa787 force-pushed the fix/unknown-column-proto-hooks branch from 23dcf74 to 127d075 Compare May 25, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Port UnKnownColumn to use try_to_proto / try_from_proto

4 participants