Skip to content

Add lambda substrait support#21193

Open
gstvg wants to merge 22 commits into
apache:mainfrom
gstvg:lambda_substrait
Open

Add lambda substrait support#21193
gstvg wants to merge 22 commits into
apache:mainfrom
gstvg:lambda_substrait

Conversation

@gstvg
Copy link
Copy Markdown
Contributor

@gstvg gstvg commented Mar 27, 2026

Which issue does this PR close?

Part of #21172

Rationale for this change

Substrait support wasn't implemented in the core lambda support to reduce PR size

What changes are included in this PR?

Substrait consuming and producing of higher-order functions, lambdas and lambda variables

Are these changes tested?

Unit tests added to datafusion/substrait/tests/cases/roundtrip_logical_plan.rs

Are there any user-facing changes?

None

@github-actions github-actions Bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate labels Mar 27, 2026
@gstvg gstvg mentioned this pull request Mar 27, 2026
27 tasks
@benbellick
Copy link
Copy Markdown
Contributor

👋 Hello from @substrait-io. Great to see the core lambda PR has gotten through! Once this PR is in a ready to review state and is rebased off of main, I will be more than happy to help review it 🙂

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented Apr 29, 2026

Thanks @benbellick, I will open this tonight. Besides rebasing, I believe it misses some tests (I tested with sqllogictests only)

@github-actions github-actions Bot added documentation Improvements or additions to documentation spark and removed documentation Improvements or additions to documentation sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) catalog Related to the catalog crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation labels Apr 29, 2026
@github-actions github-actions Bot added the auto detected api change Auto detected API change label May 8, 2026
Comment on lines +532 to +539
/// Returns a new instance of this consumer which includes the given `lambda_parameters` and the names they got assigned
///
/// Note for custom implementations it's possible to embed a [DefaultSubstraitLambdaConsumer] and forward this method to it
fn with_lambda_parameters(
&self,
lambda_parameters: &[Type],
input_schema: &DFSchema,
) -> datafusion::common::Result<(Vec<String>, Self)>;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbellick I usually follows the pattern of existing methods, like push/pop_outer_schema from #20439. This is my first time dealing with substrait, so I may be wrong, but I didn't followed this pattern (using push/pop_lambda_parameter) because it modifies &self via a RwLock and I'm note sure this couldn't lead to conflicts if the same consumer is used to consume two different plans at the same time in different threads. If that's not the case I can change this to use push/pop_lambda_parameter as well.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that is an interesting point... Is it expected/supported for the same SubstraitConsumer instance to be used concurrently?

If not, then I think it would be simpler and consistent to model lambda scope the same way as outer schemas.

If yes, then the scoped-consumer approach here makes sense, but it seems like the existing push_outer_schema / pop_outer_schema stack may have the same interleaving issue and should probably be addressed separately.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll confirm with the maintainer who ends up reviewing this, but SubstraitConsumer is both Send + Sync and all it's methods take &self. Since the default consumer is cheap to create, and I expect most/all custom ones to be cheap as well, I guess it's mostly due to async and to easily embed it into other structures which should also implement Send + Sync than to allow efficient concurrent usage.
I won't expect any consumer to be used concurrently for performance, but, since it can be used, I think it's possible that it's/will be used concurrently incidentally as the easier/natural way within a given codebase

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If yes, then the scoped-consumer approach here makes sense, but it seems like the existing push_outer_schema / pop_outer_schema stack may have the same interleaving issue and should probably be addressed separately.

@benbellick is right on point here, I'd aim for consistency in this PR, and address separately the multiple threads dealing with the same consumer in a different PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that makes sense, done at 44384dd thanks

Comment thread datafusion/substrait/tests/cases/roundtrip_logical_plan.rs Outdated
@github-actions github-actions Bot removed the auto detected api change Auto detected API change label May 9, 2026
Copy link
Copy Markdown
Contributor

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments but on the whole this looks good to me! Thanks

Comment thread datafusion/substrait/tests/testdata/test_plans/higher_order_function.json Outdated
Comment thread datafusion/substrait/tests/cases/serialize.rs Outdated
Comment thread datafusion/substrait/src/logical_plan/consumer/expr/lambda.rs
Copy link
Copy Markdown
Contributor

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work! Excited to get this in 🚀

Feel free to summon the maintainers now.

Comment thread datafusion/substrait/tests/cases/serialize.rs
@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented May 14, 2026

LGTM, great work! Excited to get this in 🚀

Thanks for the review @benbellick
cc @rluvaton @comphead this is ready for the final review

@rluvaton
Copy link
Copy Markdown
Member

I'm unfamiliar with substrait except for the idea, so I don't know if I can have meaningful review

@rluvaton
Copy link
Copy Markdown
Member

@gstvg according to the breaking change detector there are no breaking change api wise in this pr, is your commemt still correct?

Are there any user-facing changes?

Yes, there are breaking changes, new methods without default implementation have been added to SubstraitConsumer and SubstraitProducer

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented May 15, 2026

@rluvaton, indeed, after applying @benbellick suggestions there's no breaking changes anymore, description updated, thanks

@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented May 18, 2026

Hey @gabotechs, I see you worked on some substrait PRs, bringing you here in case you have interest in this. thanks

Copy link
Copy Markdown
Contributor

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! gave a shallow pass and just made one suggestion, but I trust @benbellick in his review (he's a work colleague). Thanks @gstvg for your work with lambdas 🙏

Comment thread datafusion/substrait/src/extensions.rs Outdated
@gstvg
Copy link
Copy Markdown
Contributor Author

gstvg commented May 19, 2026

Thanks @gabotechs. The derive(Clone) is removed. There's only two comments from @benbellick that were outside substrait translation and weren't fully addressed, #21193 (review) and #21193 (comment)

@gabotechs
Copy link
Copy Markdown
Contributor

There's only two comments from @benbellick that were outside substrait translation and weren't fully addressed, #21193 (review) and #21193 (comment)

Answered there

plan,
@"
Projection: array_transform2(make_array(make_array(data3.p1)), (p0, p2) -> array_concat(array_transform2(p0, (p3, p4) -> p3 * p2 * p4), array_transform2(p0, (p3, p4) -> p3 * p2 * p4))) AS array_transform2(make_array(make_array(data3.p1)),(v, i) -> array_concat(array_transform2(v,(v, j) -> v * i * j),array_transform2(v,(v, j) -> v * i * j)))
Projection: array_transform2(make_array(make_array(data3.p1)), (p0, p2) -> array_concat(array_transform2(p0, (p3, p4) -> p3 * p2 * p4), array_transform2(p0, (p5, p6) -> p5 * p2 * p6))) AS array_transform2(make_array(make_array(data3.p1)),(v, i) -> array_concat(array_transform2(v,(v, j) -> v * i * j),array_transform2(v,(v, j) -> v * i * j)))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gabotechs that's expected since now there's one global next_lambda_parameter_name, before it was one per lambda scope (applies to the other test as well)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants