[Enhancement] Support bare-field join criteria join on <field> shorthand#5517
Conversation
PR Reviewer Guide 🔍(Review updated until commit 19e6693)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 19e6693
Previous suggestionsSuggestions up to commit 374d412
|
join on <field> shorthandjoin on <field> shorthand
| new Field(QualifiedName.of(rightQual, f)))) | ||
| .reduce((l, r) -> new And(l, r)) | ||
| .orElseThrow( | ||
| () -> new SemanticCheckException("join 'on' shorthand requires at least one field")); |
There was a problem hiding this comment.
is this the only error case that can throw here?
There was a problem hiding this comment.
There were two (this one, and an empty-list guard above it). Both are removed in the redesign, so neither throws anymore.
| private static String qualifierOf(UnresolvedPlan plan, String side) { | ||
| Relation relation = findRelation(plan); | ||
| if (relation == null) { | ||
| throw new SemanticCheckException( |
There was a problem hiding this comment.
nice detail on the error message!
consider using the ReportBuilder to split out suggestions from the problem
There was a problem hiding this comment.
That error path is going away — once matching is done by position in the planner, there's no "must be a named table" restriction, so the exception it was attached to is deleted.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Is putting it here the right boundary over resolving these fields in the constructor (or even frontloading it to the AST builder)?
Putting the rewrite here introduces a weird boundary, where now Join plays the role of dynamically parsing itself during attachment based directly on AST grammar choices. That multiple places comment "we actually resolve these fields in attach()" also hints that this is unintuitive.
It might be clearer to put this in the constructor. Alternatively, it's a little more work, but we could consider moving all of leftAlias, rightAlias, joinFields and similar into a dedicated JoinClause data type that supports operations like clause.expandBareFields() and clause.resolveAliases()
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
Yeah, I'll make the change Ryan suggested to fix this.
| | `source=table1 \| join right=tt on table1.id=t2.id [ source=table2 as t2 \| eval b = id ] \| eval a = 1` | `table1.id, tt.id, tt.b, a` | | ||
|
|
||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list, duplicate field names in the output are deduplicated according to the `overwrite` option. | ||
| * **Field deduplication in extended syntax** – When using the extended syntax with a field list (`<join-field-list>`), duplicate field names are deduplicated according to the `overwrite` option. The bare-field `on`/`where` shorthand keeps both key columns, unlike the field-list form. |
There was a problem hiding this comment.
the query grammar at the top of the doc wasn't updated to show what join-field-list is
There was a problem hiding this comment.
Good catch! I'll cut it because it isn't adding anything valuable to the docs.
| if (rightAlias.isEmpty() && this.right instanceof SubqueryAlias alias) { | ||
| rightAlias = Optional.of(alias.getAlias()); | ||
| } | ||
| rewriteBareFieldCriteria(); |
There was a problem hiding this comment.
Performing semantic rewriting inside the AST node's attach() mixes tree construction with analysis, which usually lives in the analyzer layer here. The unresolved tree no longer reflects what the user wrote, which can make explain output and debugging confusing. Was an analyzer pass considered?
There was a problem hiding this comment.
Yes — that's the fix I'm making. I'm going to move the expansion into visitJoin instead of attach(), so the AST
keeps query how wrote it and the analysis happens in the planner where it belongs. Explain output will now match the original query too.
Let `on`/`where` join criteria accept a bare field name (or `AND` of bare fields) as shorthand for the qualified equality on that common field: source=t1 | inner join on a t2 == on t1.a = t2.a source=t1 | inner join on a AND b t2 == on t1.a = t2.a AND t1.b = t2.b source=t1 | join on a t2 == on t1.a = t2.a (no prefix/alias) The shorthand behaves exactly like the explicit `on l.f = r.f` criteria: it KEEPS BOTH key columns (the right key renamed `<rightAlias>.f`, or `<rightTable>.f` when no alias is given). This differs from the field-list join (`join f t2`), which merges the duplicate key to a single column. Implementation: - Grammar: reorder the joinCommand alternatives so the criteria alternative is listed first. This lets the no-prefix form `join on a <right>` parse `on` as the criteria keyword instead of the field-list alternative greedily consuming `on`/`where` as a field name. ANTLR ALL(*) picks the first-listed alternative on genuine ambiguity. - AstBuilder: a bare single-part field (or AND-chain of them) is left as the join condition verbatim -- the unresolved AST reflects what the user wrote. - Planner (CalciteRelNodeVisitor.visitJoin): the criteria branch detects a bare-field condition and builds the equi-join from it, resolving each field by stack position (LEFT.f = RIGHT.f) via the existing buildJoinConditionByFieldName helper. No new planner code, and no AST mutation. Because resolution is positional rather than by qualifier, a self-join `join on f t` is a genuine cross-scan equi-join rather than the `f = f` tautology a name-based rewrite would collapse to. Adds AstBuilder, Calcite plan, anonymizer, and integration tests; updates the join command user manual. Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
374d412 to
19e6693
Compare
|
Persistent review updated to latest commit 19e6693 |
|
📝 Implementation note — the description above predates the redesign. Heads up for anyone reading top-to-bottom: after the review feedback, I moved the bare-field expansion out of the AST ( Design (as committed):
Files actually changed (
Tests (re-run locally just now, all green):
|
Follow-ups from the second review round on opensearch-project#5517: - collectBareFields: replace the boolean-return + out-param pattern with an Optional<List<String>> return (RyanL1997). The name list is now assembled only when every node in the AND-tree is a bare field, so the "valid only when true" partial-state footgun is gone. Update the visitJoin caller. - Add JoinAndLookupUtilsTest covering single field, AND-chain, qualified field, mixed condition, and the no-partial-state case. - Add testJoinMixedConditionFallsThrough / testJoinMixedConditionWithInequality confirming a mixed condition (e.g. `on a AND b.x = c.y`) is not detected as bare and falls through to analyzeJoinCondition unchanged (RyanL1997). - Add testJoinWithImplicitFieldMaxGreaterThanZero and testAntiJoinWithImplicitField for the max=N dedup wrapper and the anti-join early-return branch. - CalcitePPLJoinIT.testJoinWithImplicitField: assert the sorted result with verifyDataRowsInOrder so the `| sort name, occupation` is actually validated instead of being silently order-independent. - Tighten the not-on-both-sides assertion to IllegalArgumentException with the field-not-found substring; correct the comment to note the phrase is Calcite's own wording and may shift on a Calcite upgrade. Signed-off-by: Hanyu Wei <weihanyu@amazon.com>
Description
Lets
joinon/wherecriteria accept a bare field name (or anANDchain of bare fields) as shorthand for the qualified equality on that common field. Before this change, the only way to join on a shared column was to spell out both qualifiers:The shorthand behaves exactly like the explicit
on l.f = r.fit sugars: it keeps both key columns (the right key surfaced as<rightAlias>.f, or<rightTable>.fwhen no alias is given). This is intentionally different from the field-list joinjoin f t2, which merges the duplicate key into a single column.Two things made this more than a one-line grammar tweak, and shaped the design:
onis a keyword that can also be an identifier, so the no-prefix formjoin on a t2was greedily consumed by the field-list alternative —onparsed as a field name, failing withfield [on] not found. Reordering thejoinCommandalternatives so the criteria alternative is listed first makes ANTLR ALL(*) prefer readingonas the criteria keyword on genuine ambiguity, with zero change to the field-list form's behavior.on acarries no table qualifier, but the keep-both planner branch needsleft.a = right.a. Resolution is done in the planner (CalciteRelNodeVisitor.visitJoin), where both inputs are already on the RelBuilder stack: each bare field is resolved by stack position (LEFT.f = RIGHT.f) via the existingbuildJoinConditionByFieldNamehelper. The unresolved AST keeps the criteria exactly as the user wrote it (on a), so explain/anonymizer reflect the original query. Because matching is positional rather than by qualifier, a self-joinjoin on f tis a genuine cross-scan equi-join (=($7, $15)), not thef = ftautology a name-based rewrite would collapse to.The keep-both column-rename logic is unchanged — it works off the output row types, independent of how the join condition was built. A mixed condition that isn't all bare fields (e.g.
on a AND b.x = c.y) is detected as not-bare and falls through to the normalanalyzeJoinConditionpath untouched.Resolves #5484.
Grammar precedence change
This PR reorders the two
joinCommandalternatives so the criteria alternative is listed first:This is broader than the new shorthand. ANTLR's ALL(*) picks the first-listed alternative on genuine ambiguity, so the reorder changes the parse path of all join queries, not just the new bare-field form. Without it, the no-prefix form
join on a <right>would let the field-list alternative greedily consumeonas a field name instead of as the criteria keyword.Existing behavior was re-verified, not assumed. Two guard tests pin the unchanged paths:
testJoinFieldListStillParsesAsFieldList—source=t1 | join a t2still parses as a field-list join (getJoinFields() == [a], no condition).testJoinPrefixWithoutCriteriaKeywordIsSyntaxError—source=t1 | inner join a t2(a prefix with a bare field but noon/where) is still aSyntaxCheckException.Edge case (verified).
ONandWHEREare both reachable as bare-field identifiers viakeywordsCanBeId, andjoinCriteria: (ON | WHERE) logicalExpressiontreats them identically. So a field literally namedonorwherein the no-prefix form (join on t2/join where t2) shifts from a field-list parse to a criteria parse. This is acceptable:on/whereare keyword-like, and their use as bare field-list identifiers in a join prefix was already ambiguous/undocumented.Changes
ppl/src/main/antlr/OpenSearchPPLParser.g4joinCommandso the criteria alternative precedes the field-list alternative, sojoin on areadsonas the criteria keyword instead of a fieldppl/.../ppl/parser/AstBuilder.javaANDchain) as the join condition rather than treating it as a field list — the unresolved AST reflects what the user wrotecore/.../calcite/CalciteRelNodeVisitor.javavisitJoincriteria branch detects a bare-field condition and builds the equi-join by stack position (LEFT.f = RIGHT.f) via the existingbuildJoinConditionByFieldName; otherwise falls through toanalyzeJoinConditionunchangedcore/.../calcite/utils/JoinAndLookupUtils.javacollectBareFieldsdetects a bare single-part field / AND-chain and returnsOptional<List<String>>(the name list is assembled only when every node is a bare field)docs/user/ppl/cmd/join.md<joinCriteria>parameter description, and the field-dedup limitation noteppl/.../calcite/CalcitePPLJoinTest.java,ppl/.../parser/AstBuilderTest.java,ppl/.../utils/PPLQueryDataAnonymizerTest.java,core/.../calcite/utils/JoinAndLookupUtilsTest.javainteg-test/.../calcite/remote/CalcitePPLJoinIT.javaTesting
CalcitePPLJoinITon a live cluster, exercising the new shorthand against pre-change behavior:testJoinWithImplicitFieldinner join on name— keep-both; asserts shorthand output is identical to expliciton l.name = r.name; result asserted in sort order viaverifyDataRowsInOrdertestJoinNoPrefixImplicitFieldjoin on name— no prefix;onparsed as the criteria keyword, identical to the explicit form (not the field-list merge)testLeftJoinWithImplicitFieldleft join on name— unmatched-left rows keep the left key (no NULL key; keep-both does not coalesce)testAliasedSelfJoinWithImplicitFieldUnit suites (all green, run locally):
CalcitePPLJoinTest: 57 — plan shape for single/multi-field, alias,where, no-prefix, left/semi/anti, self-join (real equi-join),max=Ndedup wrapper, mixed-condition fall-through, and not-on-both-sides cases.AstBuilderTest: 158 — rewrite-vs-passthrough decisions (bare field kept; qualified/comparison criteria untouched; field-list still parses as a field list; prefix-without-criteria is a syntax error).PPLQueryDataAnonymizerTest: 119 — shorthand anonymizes correctly (renderson identifier).JoinAndLookupUtilsTest: 5 —collectBareFieldsover single field, AND-chain, qualified field, mixed condition, and the no-partial-state case.Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.