Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18 by kongfanshen-0801 · Pull Request #1799 · apache/cloudberry

kongfanshen-0801 · 2026-06-03T08:26:36Z

What

Backports two upstream PostgreSQL commits that add Hash Right Semi Join:

aa86129e1 — Support "Right Semi Join" plan shapes
5668a857d — Fix right-semi-joins in HashJoin rescans

This lets the planner build the hash table on the smaller (LHS) side of an
IN/EXISTS semijoin instead of always hashing the inner relation.

Cloudberry/GPDB-specific adaptations

nodes.h: JOIN_RIGHT_SEMI is appended at the end of the JoinType
enum rather than in upstream's mid-list position. Inserting it mid-list
shifts the integer values of the GPDB-only JOIN_DEDUP_SEMI/_REVERSE and
JOIN_UNIQUE_* codes, which corrupts MPP motion planning and crashes during
dispatch (SIGSEGV in setupCdbProcessList). Appending keeps every existing
value stable.
cdbpath.c (cdbpath_motion_for_join, serial + parallel switches):
handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI — the inner
(build) side must not be replicated, since a right-semi join emits
build-side rows.
joinrels.c: add the JOIN_RIGHT_SEMI path alongside JOIN_SEMI while
preserving the existing GPDB JOIN_DEDUP_SEMI handling.

Testing

Functional: Hash Right Semi Join is chosen for small-build-side semijoins;
results verified correct (dedup semantics, rescan correctness, MPP execution
across segments).
Regression: the join test expected output is still being reconciled
against CI's canonical environment — hence this PR is opened as a draft.

Notes

Only affects the PostgreSQL planner (optimizer=off); GPORCA does not
generate JOIN_RIGHT_SEMI.

my-ship-it · 2026-06-03T08:41:08Z

Thanks for the back porting. Could we keep the original commit history, thanks

Hash joins can support semijoin with the LHS input on the right, using the existing logic for inner join, combined with the assurance that only the first match for each inner tuple is considered, which can be achieved by leveraging the HEAP_TUPLE_HAS_MATCH flag. This can be very useful in some cases since we may now have the option to hash the smaller table instead of the larger. Merge join could likely support "Right Semi Join" too. However, the benefit of swapping inputs tends to be small here, so we do not address that in this patch. Note that this patch also modifies a test query in join.sql to ensure it continues testing as intended. With this patch the original query would result in a right-semi-join rather than semi-join, compromising its original purpose of testing the fix for neqjoinsel's behavior for semi-joins. Author: Richard Guo Reviewed-by: wenhui qiu, Alena Rybakina, Japin Li Discussion: https://postgr.es/m/CAMbWs4_X1mN=ic+SxcyymUqFx9bB8pqSLTGJ-F=MHy4PW3eRXw@mail.gmail.com (cherry picked from commit aa86129e19d704afb93cb84ab9638f33d266ee9d)

When resetting a HashJoin node for rescans, if it is a single-batch join and there are no parameter changes for the inner subnode, we can just reuse the existing hash table without rebuilding it. However, for join types that depend on the inner-tuple match flags in the hash table, we need to reset these match flags to avoid incorrect results. This applies to right, right-anti, right-semi, and full joins. When I introduced "Right Semi Join" plan shapes in aa86129e1, I failed to reset the match flags in the hash table for right-semi joins in rescans. This oversight has been shown to produce incorrect results. This patch fixes it. Author: Richard Guo Discussion: https://postgr.es/m/CAMbWs4-nQF9io2WL2SkD0eXvfPdyBc9Q=hRwfQHCGV2usa0jyA@mail.gmail.com (cherry picked from commit 5668a857de4f3f12066b2bbc626b77be4fc95ee5)

This commit carries the Cloudberry/Greenplum-specific changes needed on top of the two cherry-picked upstream commits (aa86129e1, 5668a857d), which only touch the upstream PostgreSQL planner/executor files. - nodes.h: move JOIN_RIGHT_SEMI to the END of the JoinType enum. Upstream places it next to JOIN_RIGHT_ANTI, but in the Cloudberry tree that shifts the integer values of the GPDB-specific JOIN_DEDUP_SEMI/REVERSE and JOIN_UNIQUE_* codes. Value-dependent code then corrupts MPP motion planning, producing a degenerate plan ("Gather Motion 0:1" / "Redistribute Motion 1:0") that crashes with SIGSEGV in setupCdbProcessList() during dispatch. Appending keeps every pre-existing enum value stable. - cdbpath.c (cdbpath_motion_for_join, both the serial and parallel switch): handle JOIN_RIGHT_SEMI like JOIN_RIGHT/JOIN_RIGHT_ANTI. A right-semi join emits inner (build-side) rows, so the inner side must not be replicated, otherwise matched inner rows could be emitted more than once. Without this the new join type would hit the switch default and elog(ERROR, "unexpected join type") at plan time. Note: this feature is only exercised by the PostgreSQL planner (optimizer=off); GPORCA does not generate JOIN_RIGHT_SEMI.

kongfanshen-0801 · 2026-06-04T02:05:27Z

Thanks for the review! I've restructured the branch to preserve the original commit history:

The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:
- Support "Right Semi Join" plan shapes (aa86129e1)
- Fix right-semi-joins in HashJoin rescans (5668a857d)
The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

my-ship-it · 2026-06-04T04:58:42Z

Thanks for the review! I've restructured the branch to preserve the original commit history:

The two upstream PostgreSQL commits are now cherry-picked with their original author (Richard Guo), original commit messages, and (cherry picked from commit ...) provenance lines preserved:

Support "Right Semi Join" plan shapes (aa86129e1)

Fix right-semi-joins in HashJoin rescans (5668a857d)

The Cloudberry/Greenplum-specific adaptations (moving JOIN_RIGHT_SEMI to the end of the JoinType enum to keep the GPDB enum values ABI-stable, and handling the new join type in cdbpath_motion_for_join) are isolated in a separate follow-up commit.

The code logic is unchanged from what I verified locally (Hash Right Semi Join chosen and correct via EXPLAIN ANALYZE + cross-validation). PTAL, thanks!

Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21

kongfanshen-0801 · 2026-06-17T10:49:36Z

Thanks, if have a chance, we could do some comparison of before and after for TPC-H Q 21

Done — ran TPC-H Q21 before/after on the optimizer_enable_right_semi_join GUC.

Setup: SF=100 (lineitem 600,037,902 rows), GPORCA, 3-segment single-host demo cluster, 3 timed runs each.

`optimizer_enable_right_semi_join`	semijoin plan	runs (s)	best
on	Hash Right Semi Join	355.8 / 340.1 / 339.2	339.2
off	Hash Semi Join	353.0 / 349.3 / 352.2	349.3

Q21's EXISTS (l2 ...) semijoin is where the two plans diverge:

on — Hash Right Semi Join: builds the hash on the small LHS (448,837 rows, 2 batches, ~17 MB) and streams lineitem l2 (200M rows).
off — Hash Semi Join: builds the hash on lineitem l2 (200,042,924 rows, 512 batches → heavy spill), Work_mem wanted: 8.6 GB.

So the right-semi shape shrinks the build side from ~200M to ~449K rows, cuts the semijoin hash spill from 512 → 2 batches, and drops requested work_mem from ~8.6 GB → ~14 MB.

End-to-end wall-clock is ~3% better here because on this small cluster the total is dominated by the three lineitem scans (l1/l2/l3) and the l3 anti-join spill that are common to both plans; the build-side / memory win should grow on larger, properly-sized clusters and with tighter memory settings.

I also added an ORCA regression test (src/test/regress/sql/rightsemijoin.sql, with both planner and _optimizer answer files) asserting the GUC flips the plan between right-semi/anti and the regular semi/anti shapes with identical results, and wrote the Q21 numbers into the commit message. PTAL, thanks!

kongfanshen-0801 · 2026-06-18T02:03:18Z

Follow-up on the Q21 numbers above — clarifying why the end-to-end gain is only ~3%, and showing the isolated win.

Where Q21's time actually goes (row executor)

Q21 scans lineitem three times and the cost is dominated by parts that are identical with the GUC on or off:

Hash Right Semi Join (l1 ⋈ l2)        <- the ONLY node right-semi changes
├─ Seq Scan lineitem l2 = 200,042,924
└─ Hash Anti Join (l1 ⋈ l3)
    ├─ Hash Join (l1 ⋈ supplier)
    │   └─ Seq Scan lineitem l1 = 126,464,414
    └─ Hash(build l3) = 126,464,414, Batches: 256 (spills)
        └─ Seq Scan lineitem l3 = 126,464,414

Three lineitem scans + per-row tuple deform ≈ 452M rows (~3 × 74 GB I/O plus deforming 16 columns/row in the row executor) — the single biggest chunk.
The l3 anti-join builds a 126M-row hash that spills to 256 batches — a separate large hash join that right-semi does not touch; identical on both sides.
The two big joins' probe work (hashing/comparing 200M + 126M rows + the l_suppkey <> l_suppkey join filter).

Right-semi only changes the build side of the outer semijoin (l1 ⋈ l2):

	semijoin build side	spill	work_mem wanted
on	small LHS, 448,837 rows, 2 batches, in-mem	none	14 MB
off	`lineitem l2`, 200,042,924 rows, 512 batches	~heavy	8.6 GB

The 200M l2 rows are scanned either way (as probe when on, as build when off), so the only net delta is the cost of building + spilling that 200M-row hash. Relative to the 452M-row scans and the l3 256-batch spill, that delta is small → ~3% wall-clock, even though total "Memory wanted" drops 60 GB → 38 GB.

Isolated semijoin: ~4× when it's the dominant cost

Removing the 3× full scans (small 10k-row LHS vs a unique 150M-row RHS, orders.o_orderkey, GPORCA), so the semijoin is the cost:

`optimizer_enable_right_semi_join`	plan	exec time (2 runs)
on	`Hash Right Semi Join` — build on 3,385-row LHS, no spill	19.7 / 19.7 s
off	fallback `HashAggregate(dedup 50M) + Hash Join` — 641 batches, 1.5 GB spill, 4.6 GB wanted	77.1 / 78.4 s

→ ~3.9× faster (≈20 s vs ≈77 s) once the build-side choice is the bottleneck.

Row vs vectorized

In the row executor, steps 1 + 3 (scanning/deforming ~452M rows and row-at-a-time hash/probe) dominate and dilute the win. In a vectorized/columnar engine those steps get much cheaper (column-pruned, batched), so the relative weight of avoiding the 200M-row build+spill grows and the Q21 wall-clock gain becomes more pronounced — consistent with the larger Q21 improvement seen on the vectorized build this was ported from.

Summary: the right-semi plan is genuinely chosen and the per-operator win is large (~4× isolated, 200M→449K build, 512→2 batches, 8.6 GB→14 MB); it just sits behind scan/deform-bound work in Q21 on the row executor, hence ~3% end-to-end here. optimizer_enable_right_semi_join defaults to on.

Teaches GPORCA to produce JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI hash joins, so the "Right Semi Join" optimization is available under the GPORCA optimizer (optimizer=on), not just the PostgreSQL planner. ORCA can now build the hash table on the smaller LHS of an IN/EXISTS/NOT EXISTS semijoin and emit the matched (semi) or unmatched (anti) LHS rows. Pieces: * Physical operators CPhysicalRightSemiHashJoin / CPhysicalRightAntiSemiHashJoin extending CPhysicalHashJoin; registered in COperator.h. * Implementation xforms CXformLeftSemiJoin2RightSemiHashJoin / CXformLeftAntiSemiJoin2RightAntiSemiHashJoin, registered in CXformFactory and added to the candidate sets of CLogicalLeftSemiJoin / CLogicalLeftAntiSemiJoin. New xform IDs are appended at the END of the CXform enum to keep the CXformSet (CEnumSet<EXformId, ExfSentinel>) ABI stable (mid-insertion crashes CSearchStage unless every TU is rebuilt; same lesson as the PG JoinType enum). * DXL: EdxljtRightSemijoin / EdxljtRightAntiSemijoin join types + tokens, with Expr->DXL and DXL->PlStmt / CTranslatorUtils mappings to the GPDB JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI executor join types. * Cost model: CostRightSemiHashJoin / CostRightAntiSemiHashJoin reuse the CostHashJoin formula with the build/probe roles swapped. Build-side swap in CTranslatorDXLToPlStmt (the key adaptation to the PG executor): the GPDB executor always builds the hash table on the plan's inner (right) child, but ORCA places the semantically-preserved LHS as the DXL left child. For JOIN_RIGHT_SEMI / JOIN_RIGHT_ANTI we therefore (a) translate the DXL right child as the executor's outer/probe and the DXL left child (LHS) as the inner/Hash build side, (b) keep the child contexts in outer-then-inner order so target list, quals and hash clauses get the correct OUTER_VAR/INNER_VAR, and (c) take the outer/inner hash keys from the opposite hash-clause operands to match the swapped sides. This produces exactly the plan shape the PostgreSQL planner emits (inner = LHS), so the existing PG JOIN_RIGHT_SEMI/ANTI executor support (from the preceding backport commits) runs it correctly -- this replaces Lightning's vec-executor PostBuildVecPlan swap, which Apache Cloudberry does not have. Adapted from hashdata-lightning commit c3be66315ee (by GongXun); the Arrow/Acero vectorized executor, parallel-hash variants, and vec-engine GUC cost gating were dropped (not applicable here). Verified against the PostgreSQL planner (optimizer=off) on a 25-case cross-optimizer correctness battery -- IN / EXISTS / NOT IN / NOT EXISTS, multi-column / text / correlated keys, hash / replicated / random distribution, empty / all-NULL / duplicate-key inputs, spill, self-semijoin, three-way and projection cases -- every result set matches, and EXPLAIN shows "Hash Right Semi Join" / "Hash Right Anti Join" building on the smaller LHS.

The cherry-picked upstream tests in join.sql produce optimizer- and MPP-dependent plan shapes, which made the ic-(orca-)parallel CI jobs fail: join_optimizer.out had no entry for the new hash-right-semi (tbl_rs) test, and the modified "semijoin selectivity for <>" EXPLAIN differs from upstream. - Wrap the two plan-sensitive EXPLAINs (the <> semijoin selectivity query and the tbl_rs right-semi rescan query) in -- start_ignore / -- end_ignore so the plan shape is not compared; the deterministic query results are still checked. - join_optimizer.out (GPORCA): add the tbl_rs section -- ORCA produces a Hash Right Semi Join and returns the 10 expected rows -- and refresh the <> section for the new query. - join.out (Postgres planner): the tbl_rs correlated query hits the planner's "skip-level correlations not supported" limitation, so the error is recorded as the expected output; ORCA handles it. Only these two backport-related sections change; the rest of the answer files are left at their canonical content.

…H Q21 results Adds a GUC (default on) to enable/disable GPORCA's right semi/anti hash join xforms (CXformLeftSemiJoin2RightSemiHashJoin / CXformLeftAntiSemiJoin2RightAntiSemiHashJoin). When turned off, the xforms are disabled via the standard CConfigParamMapping traceflag mechanism (GPOPT_DISABLE_XFORM_TF), so ORCA falls back to its regular left-semi / anti plans. Serves both as a kill-switch for the new plan shape and as the on/off toggle for before/after performance comparisons (e.g. TPC-H Q21). The GUC is registered in unsync_guc_name.h (per-segment, no QD/QE sync required), matching the other optimizer_enable_* developer GUCs. SET optimizer_enable_right_semi_join = off; -- ORCA uses plain Hash Semi Join SET optimizer_enable_right_semi_join = on; -- ORCA may use Hash Right Semi Join Regression test --------------- Adds src/test/regress/sql/rightsemijoin.sql (registered in greenplum_schedule) with both planner (rightsemijoin.out) and GPORCA (rightsemijoin_optimizer.out) answer files. Under GPORCA it asserts that the GUC flips the plan between Hash Right Semi/Anti Join (build on the small LHS) and the regular Hash Semi/Anti Join, and that results are identical either way. TPC-H Q21 before/after (SF=100) ------------------------------- TPC-H Q21 contains an EXISTS (semijoin) and a NOT EXISTS (anti join) over lineitem, so it exercises both new plan shapes. Measured on a 3-segment single-host demo cluster, SF=100 (lineitem 600,037,902 rows), GPORCA, 3 timed runs each: optimizer_enable_right_semi_join = on 355.8 / 340.1 / 339.2 s (best 339.2) optimizer_enable_right_semi_join = off 353.0 / 349.3 / 352.2 s (best 349.3) The semijoin (l1 EXISTS l2) is where the two plans differ: ON : Hash Right Semi Join - builds the hash on the small LHS (448,837 rows, 2 batches, ~17 MB); streams lineitem l2 (200M rows). OFF : Hash Semi Join - builds the hash on lineitem l2 (200,042,924 rows, 512 batches -> heavy spill); Work_mem wanted 8.6 GB. So the right-semi plan shrinks the build side from ~200M rows to ~449K rows, cuts the semijoin hash spill from 512 batches to 2, and drops the requested work_mem from ~8.6 GB to ~14 MB. End-to-end wall-clock is ~3% better here because total runtime on this small cluster is dominated by the three lineitem scans (l1/l2/l3) and the l3 anti-join spill, which are common to both plans; the build-side/memory win grows on larger, properly-sized clusters and tighter memory settings.

kongfanshen-0801 marked this pull request as ready for review June 3, 2026 08:26

Richard Guo and others added 3 commits June 4, 2026 10:01

kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from a869ab3 to 4328d2f Compare June 4, 2026 02:04

kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch 4 times, most recently from 5dc8186 to 8ad6a81 Compare June 17, 2026 10:49

kongfanshen added 3 commits June 18, 2026 10:23

kongfanshen-0801 force-pushed the feature/right-semi-join-backport branch from 8ad6a81 to 34dfda2 Compare June 18, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799

Backport "Right Semi Join" (Hash Right Semi Join) from PostgreSQL 18#1799
kongfanshen-0801 wants to merge 6 commits into
apache:mainfrom
kongfanshen-0801:feature/right-semi-join-backport

kongfanshen-0801 commented Jun 3, 2026

Uh oh!

my-ship-it commented Jun 3, 2026

Uh oh!

kongfanshen-0801 commented Jun 4, 2026

Uh oh!

my-ship-it commented Jun 4, 2026

Uh oh!

kongfanshen-0801 commented Jun 17, 2026

Uh oh!

kongfanshen-0801 commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kongfanshen-0801 commented Jun 3, 2026

What

Cloudberry/GPDB-specific adaptations

Testing

Notes

Uh oh!

my-ship-it commented Jun 3, 2026

Uh oh!

kongfanshen-0801 commented Jun 4, 2026

Uh oh!

my-ship-it commented Jun 4, 2026

Uh oh!

kongfanshen-0801 commented Jun 17, 2026

Uh oh!

kongfanshen-0801 commented Jun 18, 2026

Where Q21's time actually goes (row executor)

Isolated semijoin: ~4× when it's the dominant cost

Row vs vectorized

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants