Add lambda substrait support#21193
Conversation
|
👋 Hello from @substrait-io. Great to see the core lambda PR has gotten through! Once this PR is in a ready to review state and is rebased off of main, I will be more than happy to help review it 🙂 |
|
Thanks @benbellick, I will open this tonight. Besides rebasing, I believe it misses some tests (I tested with sqllogictests only) |
| /// Returns a new instance of this consumer which includes the given `lambda_parameters` and the names they got assigned | ||
| /// | ||
| /// Note for custom implementations it's possible to embed a [DefaultSubstraitLambdaConsumer] and forward this method to it | ||
| fn with_lambda_parameters( | ||
| &self, | ||
| lambda_parameters: &[Type], | ||
| input_schema: &DFSchema, | ||
| ) -> datafusion::common::Result<(Vec<String>, Self)>; |
There was a problem hiding this comment.
@benbellick I usually follows the pattern of existing methods, like push/pop_outer_schema from #20439. This is my first time dealing with substrait, so I may be wrong, but I didn't followed this pattern (using push/pop_lambda_parameter) because it modifies &self via a RwLock and I'm note sure this couldn't lead to conflicts if the same consumer is used to consume two different plans at the same time in different threads. If that's not the case I can change this to use push/pop_lambda_parameter as well.
There was a problem hiding this comment.
Ah, that is an interesting point... Is it expected/supported for the same SubstraitConsumer instance to be used concurrently?
If not, then I think it would be simpler and consistent to model lambda scope the same way as outer schemas.
If yes, then the scoped-consumer approach here makes sense, but it seems like the existing push_outer_schema / pop_outer_schema stack may have the same interleaving issue and should probably be addressed separately.
There was a problem hiding this comment.
I'll confirm with the maintainer who ends up reviewing this, but SubstraitConsumer is both Send + Sync and all it's methods take &self. Since the default consumer is cheap to create, and I expect most/all custom ones to be cheap as well, I guess it's mostly due to async and to easily embed it into other structures which should also implement Send + Sync than to allow efficient concurrent usage.
I won't expect any consumer to be used concurrently for performance, but, since it can be used, I think it's possible that it's/will be used concurrently incidentally as the easier/natural way within a given codebase
There was a problem hiding this comment.
If yes, then the scoped-consumer approach here makes sense, but it seems like the existing push_outer_schema / pop_outer_schema stack may have the same interleaving issue and should probably be addressed separately.
@benbellick is right on point here, I'd aim for consistency in this PR, and address separately the multiple threads dealing with the same consumer in a different PR.
There was a problem hiding this comment.
that makes sense, done at 44384dd thanks
benbellick
left a comment
There was a problem hiding this comment.
Few more comments but on the whole this looks good to me! Thanks
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
benbellick
left a comment
There was a problem hiding this comment.
LGTM, great work! Excited to get this in 🚀
Feel free to summon the maintainers now.
Thanks for the review @benbellick |
|
I'm unfamiliar with substrait except for the idea, so I don't know if I can have meaningful review |
|
@gstvg according to the breaking change detector there are no breaking change api wise in this pr, is your commemt still correct?
|
|
@rluvaton, indeed, after applying @benbellick suggestions there's no breaking changes anymore, description updated, thanks |
|
Hey @gabotechs, I see you worked on some substrait PRs, bringing you here in case you have interest in this. thanks |
There was a problem hiding this comment.
Looks good! gave a shallow pass and just made one suggestion, but I trust @benbellick in his review (he's a work colleague). Thanks @gstvg for your work with lambdas 🙏
|
Thanks @gabotechs. The |
Answered there |
| plan, | ||
| @" | ||
| Projection: array_transform2(make_array(make_array(data3.p1)), (p0, p2) -> array_concat(array_transform2(p0, (p3, p4) -> p3 * p2 * p4), array_transform2(p0, (p3, p4) -> p3 * p2 * p4))) AS array_transform2(make_array(make_array(data3.p1)),(v, i) -> array_concat(array_transform2(v,(v, j) -> v * i * j),array_transform2(v,(v, j) -> v * i * j))) | ||
| Projection: array_transform2(make_array(make_array(data3.p1)), (p0, p2) -> array_concat(array_transform2(p0, (p3, p4) -> p3 * p2 * p4), array_transform2(p0, (p5, p6) -> p5 * p2 * p6))) AS array_transform2(make_array(make_array(data3.p1)),(v, i) -> array_concat(array_transform2(v,(v, j) -> v * i * j),array_transform2(v,(v, j) -> v * i * j))) |
There was a problem hiding this comment.
@gabotechs that's expected since now there's one global next_lambda_parameter_name, before it was one per lambda scope (applies to the other test as well)
Which issue does this PR close?
Part of #21172
Rationale for this change
Substrait support wasn't implemented in the core lambda support to reduce PR size
What changes are included in this PR?
Substrait consuming and producing of higher-order functions, lambdas and lambda variables
Are these changes tested?
Unit tests added to
datafusion/substrait/tests/cases/roundtrip_logical_plan.rsAre there any user-facing changes?
None