Add partitioned probe support for hash joins by PointKernel · Pull Request #22108 · rapidsai/cudf

PointKernel · 2026-04-10T20:28:19Z

Description

Closes #18677

This introduces partitioned probe support for hash joins and refactors the join internals to reduce duplication.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…robe

copy-pr-bot · 2026-04-10T20:28:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

PointKernel · 2026-04-10T20:29:03Z

/ok to test f10754d

…robe

PointKernel · 2026-04-14T03:12:01Z

/ok to test c14cbba

…robe

…> reduce

copy-pr-bot · 2026-04-28T23:46:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

PointKernel · 2026-04-28T23:47:17Z

/ok to test f9a260f

PointKernel · 2026-05-04T20:09:58Z

/ok to test 5b5e0af

shrshi

Two non-blocking nits, but looks great otherwise! :)

…robe

PointKernel · 2026-05-04T23:37:26Z

/ok to test 6cd86fb

wence-

Requesting a few clarifications in the docstrings because every time I read the join docstrings I get confused.

wence- · 2026-05-05T10:59:02Z

+   *
+   * @return A pair of device vectors [`left_indices`, `right_indices`] for this partition
+   */
+  [[nodiscard]] std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,


I am trying to understand how to use this function, and I must admit I am struggling from this docstring.

Suppose I am doing an inner join left \cap right. I think what I do is:

Create hash join object hasher with the build table right

call hasher.inner_join_match_context with the complete probe table left

Having observed that the output size is "too big", call hasher.partitioned_inner_join(partition_ctx) to obtain gather indices for the left and right tables (probe and build). The partition_ctx is the join_match_context plus slice bounds of the probe (left) table.

Gather from left and right (unsliced)

What does the join_partition_context buy me over just cudf::sliceing the left table once I have determined that the join output will be too big?

Is it that partitioned_left_join and partitioned_full_join need to manage more internal state so you want a symmetric interface?

join_partition_context holds per-row match counts computed once by inner_join_match_context() over the full probe table. Each partitioned_inner_join uses a slice of it to size outputs and compute offsets, avoiding re-probing for counts. Calling inner_join per partition would redo that work. The full-join variant also uses the same context to carry build-side complement info, hence the symmetric design.

wence- · 2026-05-05T11:04:41Z

+   * Call this method after calling `partitioned_full_join()` for every partition.  It combines
+   * the per-partition probe indices with the unmatched build row indices (a global property
+   * across all partitions) and returns a single `(left_indices, right_indices)` pair equivalent
+   * to the output of `full_join()`.


question: Could we use this same facility to perform a left-join (say) where the left table is the "build" table and we want to keep track of which rows in the left table have no match in any right table so we can emit them with nulls appropriately?

We don't have the utility to support the left table as the build table for full join, so it's not supported in the current setup.

wence- · 2026-05-05T11:08:08Z

  auto count_matches = [&](auto equality, auto d_hasher) {
-    auto const iter = cudf::detail::make_counting_transform_iterator(0, pair_fn{d_hasher});
+    // Precompute probe keys: {hash(row_idx), row_idx} for each probe row.
+    auto const n = static_cast<cuda::std::int64_t>(probe_table_num_rows);


Suggested change

auto const n = static_cast<cuda::std::int64_t>(probe_table_num_rows);

auto const n = static_cast<cuda::std::size_t>(probe_table_num_rows);

?

This is intentional as we use the singed int64_t for all kernel thread indices to avoid overflow with large size_type.

cudf/cpp/include/cudf/types.hpp

Line 87 in aa3cdee

using thread_index_type = int64_t; ///< Thread index type in kernels

Let me update it to thread_index_type for clarity.

coderabbitai · 2026-05-09T00:04:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cc6efd0b-aa04-4fc8-bc13-6eb8250130a2

📥 Commits

Reviewing files that changed from the base of the PR and between be8c452 and 0a266a8.

📒 Files selected for processing (1)

cpp/CMakeLists.txt

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/CMakeLists.txt

📝 Walkthrough

Summary by CodeRabbit

New Features
- Partitioned hash-join execution added for inner, left, and full joins.
- Benchmark mode to compare partitioned vs. standard join paths.
Improvements
- Optimized partitioned probe/count/retrieve flow and a unified, more efficient full-join finalization.
Tests
- Expanded join test coverage with parameterized cases validating partitioned workflows and edge conditions.

Walkthrough

Implements partitioned (count+retrieve) hash-join: new count/retrieve kernels and launchers, partition-level inner/left/full APIs, per-partition finalization, refactor of full-join finalization, explicit template instantiations, tests, benchmark mode, and CMake additions.

Changes

Partitioned Hash Join Feature

Layer / File(s)	Summary
Public API Contracts & Semantics `cpp/include/cudf/join/join.hpp`, `cpp/include/cudf/join/hash_join.hpp`, `cpp/include/cudf/detail/join/hash_join.hpp`, `cpp/src/join/hash_join/hash_join.cu`, `cpp/src/join/hash_join/finalize_partitioned_full_join.cpp`	Adds partitioned join declarations and public wrappers (`partitioned_inner_join`, `partitioned_left_join`, `partitioned_full_join`) and `finalize_partitioned_full_join`; makes `join_match_context` non-copyable and movable.
Kernel Type & Reference Infrastructure `cpp/src/join/hash_join/kernels_common.cuh`, `cpp/src/join/hash_join/ref_types.cuh`	Introduces `probe_key_type` and centralized equality/count-ref type aliases for primitive/nested/flat dispatch paths used by kernels.
Count Phase `cpp/src/join/hash_join/partitioned_count_kernels.hpp`, `cpp/src/join/hash_join/partitioned_count_kernels.cuh`, `cpp/src/join/hash_join/partitioned_count.cu`, `cpp/src/join/hash_join/partitioned_count_outer.cu`, `cpp/src/join/hash_join/match_context.cu`	Adds partitioned match-count kernel and host launcher with warp/tile reduction and outer-join clamping; match_context now precomputes probe keys and calls `launch_partitioned_count`.
Retrieve Phase `cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp`, `cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh`, `cpp/src/join/hash_join/partitioned_retrieve.cu`, `cpp/src/join/hash_join/partitioned_retrieve_outer.cu`	Adds partitioned retrieve kernel with shared buffering and atomic-amortized global writes; launcher sizes and allocates output vectors from match counts and supports outer-join `JoinNoMatch`.
Partitioned Join Orchestration `cpp/src/join/hash_join/partitioned_join_retrieve.cu`, `cpp/src/join/hash_join/partitioned_inner_join.cu`, `cpp/src/join/hash_join/partitioned_left_join.cu`, `cpp/src/join/hash_join/partitioned_full_join.cu`	Orchestrates per-partition work: empty-left/build cases, probe slicing, key preprocessing, comparator dispatch, and launches count+retrieve; thin wrappers per join kind.
Full-Join Finalization Consolidation `cpp/src/join/join_common_utils.hpp`, `cpp/src/join/join_utils.cu`, `cpp/src/join/hash_join/retrieve_impl.cuh`, `cpp/src/join/conditional_join.cu`, `cpp/src/join/mixed_join.cu`	Centralizes `finalize_full_join` overloads handling single-probe and per-partition partials, replacing earlier complement/concatenate helpers with scatter/copy_if-based unmatched-row emission.
Build, Benchmarks & Tests `cpp/CMakeLists.txt`, `cpp/benchmarks/join/join.cu`, `cpp/tests/join/join_tests.cpp`	Registers new sources in CMake; benchmark `nvbench_inner_join` gains `mode` axis (`normal`,`partitioned`); tests extended with partitioned inner/left/full and edge-case coverage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

mhaseeb123
mythrocks
wence-

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'Add partitioned probe support for hash joins' directly and clearly describes the main change: introducing partitioned probe functionality for hash joins.
Description check	✅ Passed	The PR description relates to the changeset by mentioning partitioned probe support, referencing the related issue `#18677`, and indicating tests and documentation were updated.
Linked Issues check	✅ Passed	The PR implements partitioned probe support for hash joins as requested in `#18677`, adding new partitioned_inner_join, partitioned_left_join, and partitioned_full_join APIs that enable chunked probe processing with count-then-retrieve phases.
Out of Scope Changes check	✅ Passed	All changes support the core objective of adding partitioned probe APIs and associated infrastructure (kernels, utilities, tests). Refactoring of full-join finalization logic consolidates duplication and directly supports the partitioned implementation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (3)

cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh (1)

15-16: ⚡ Quick win

Use cudf::detail::device_scalar for the temporary counter.

This temporary counter should follow the repo convention instead of using rmm::device_scalar directly.

As per coding guidelines, "Use cudf::detail::device_scalar instead of rmm::device_scalar."

Suggested fix

-#include <rmm/device_scalar.hpp>
+#include <cudf/detail/utilities/device_scalar.hpp>
@@
-  rmm::device_scalar<size_type> output_counter(size_type{0}, stream);
+  cudf::detail::device_scalar<size_type> output_counter(size_type{0}, stream);

Also applies to: 243-243

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh` around lines 15 -
16, Replace usage of rmm::device_scalar with cudf::detail::device_scalar for the
temporary counter: update the includes and any declarations where
rmm::device_scalar<T> is used (e.g., the include/usage around device_scalar in
partitioned_retrieve_kernels.cuh and the instance referenced near line 243) to
cudf::detail::device_scalar<T>, and adjust or add the appropriate cudf header if
needed so the symbol resolves; ensure any constructor and .value() calls remain
compatible with cudf::detail::device_scalar.

cpp/src/join/hash_join/hash_join.cu (1)

262-287: ⚡ Quick win

Add CUDF_FUNC_RANGE() to the new public hash-join entry points.

These methods are public API shims that delegate directly to _impl, so they should open an NVTX range like the other public entry points under cpp/src/.

Suggested fix

 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_inner_join(cudf::join_partition_context const& context,
                                   rmm::cuda_stream_view stream,
                                   rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_inner_join(context, stream, mr);
 }
 
 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_left_join(cudf::join_partition_context const& context,
                                  rmm::cuda_stream_view stream,
                                  rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_left_join(context, stream, mr);
 }
 
 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_full_join(cudf::join_partition_context const& context,
                                  rmm::cuda_stream_view stream,
                                  rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_full_join(context, stream, mr);
 }

As per coding guidelines, cpp/src/**/*.{cu,cpp}: Use CUDF_FUNC_RANGE() in public functions before delegating to detail:: functions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/hash_join.cu` around lines 262 - 287, The three public
shim functions partitioned_inner_join, partitioned_left_join, and
partitioned_full_join should open an NVTX range before delegating to _impl; add
a CUDF_FUNC_RANGE() call as the first statement in each function (i.e., inside
hash_join::partitioned_inner_join, hash_join::partitioned_left_join, and
hash_join::partitioned_full_join) and then return _impl->... as now so the
public entry points mirror other cpp/src/* public functions.

cpp/src/join/hash_join/finalize_partitioned_full_join.cpp (1)

19-30: ⚡ Quick win

Instrument this public finalizer with CUDF_FUNC_RANGE().

This is another public API wrapper that immediately forwards into detail::finalize_full_join, so it should open an NVTX range as well.

Suggested fix

 `#include` "join/join_common_utils.hpp"
 
+#include <cudf/detail/nvtx/ranges.hpp>
 `#include` <cudf/join/hash_join.hpp>
 `#include` <cudf/join/join.hpp>
 `#include` <cudf/types.hpp>
 `#include` <cudf/utilities/memory_resource.hpp>
 `#include` <cudf/utilities/span.hpp>
@@
 hash_join::finalize_partitioned_full_join(
   cudf::host_span<cudf::device_span<size_type const> const> left_partials,
   cudf::host_span<cudf::device_span<size_type const> const> right_partials,
   size_type probe_table_num_rows,
   size_type build_table_num_rows,
   rmm::cuda_stream_view stream,
   rmm::device_async_resource_ref mr)
 {
+  CUDF_FUNC_RANGE();
   return cudf::detail::finalize_full_join(
     left_partials, right_partials, probe_table_num_rows, build_table_num_rows, stream, mr);
 }

As per coding guidelines, cpp/src/**/*.{cu,cpp}: Use CUDF_FUNC_RANGE() in public functions before delegating to detail:: functions.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/finalize_partitioned_full_join.cpp` around lines 19 -
30, The public wrapper function hash_join::finalize_partitioned_full_join should
open an NVTX range: add a CUDF_FUNC_RANGE() invocation as the first statement
inside finalize_partitioned_full_join (before calling
cudf::detail::finalize_full_join) so the public API is instrumented; ensure the
macro is placed in the body of finalize_partitioned_full_join around the
delegation to finalize_full_join.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/cudf/join/hash_join.hpp`:
- Around line 312-410: Update the Doxygen for the four new public
APIs—partitioned_inner_join, partitioned_left_join, partitioned_full_join, and
finalize_partitioned_full_join—to include `@throw` tags that enumerate the
precondition/invalid-context failures and the exception types users should
expect; specifically document cases such as: context not created by the
corresponding *_match_context, context partition bounds outside the probe table,
mismatched sizes or inconsistent left_partials/right_partials spans, invalid
probe_table_num_rows or build_table_num_rows, and any std::invalid_argument or
std::logic_error (or specific cudf exception types used by these functions)
thrown on those violations so callers know what to catch. Ensure each function’s
comment block contains a brief `@throw` line describing the condition and the
exception type and reference the function names (partitioned_inner_join,
partitioned_left_join, partitioned_full_join, finalize_partitioned_full_join) so
the reader can map failures to the correct API.

In `@cpp/src/join/hash_join/partitioned_count_kernels.cuh`:
- Around line 89-90: After launching partitioned_count_kernel<IsOuter> add an
immediate kernel launch error check by calling
CUDF_CUDA_TRY(cudaGetLastError()); right after the <<<...>>> invocation, and add
the required include <cudf/utilities/error.hpp> at the top of the file so
CUDF_CUDA_TRY is available; this ensures any launch failures are caught and
reported immediately for the partitioned_count_kernel<IsOuter> invocation.

In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu`:
- Around line 69-105: Validate the join_partition_context (ensure
context.left_table_context is non-null and that left_start_idx and left_end_idx
form a valid range within the left table and within match_counts size) before
dereferencing match_ctx or slicing; if the context is malformed (null
left_table_context, left_start_idx > left_end_idx, or either index outside the
left table row count or match_counts length) return the appropriate empty-result
(same as the current empty-partition/empty-build branches) or throw a clear
error, and only then proceed to use match_ctx,
cudf::slice(probe_partition_view), _match_counts (partition_counts),
validate_hash_join_probe, and
cudf::detail::row::equality::preprocessed_table::create.

In `@cpp/src/join/hash_join/ref_types.cuh`:
- Around line 12-16: Add a direct include of <utility> to ref_types.cuh to bring
in std::declval (used in the decltype expressions around the ref type
detection), removing the transitive dependency on dispatch.cuh; update the top
of ref_types.cuh (alongside the existing includes of "dispatch.cuh",
"hash_join_impl.cuh", and <cuco/static_multiset.cuh>) to include <utility> so
declval is available even if dispatch.cuh no longer exposes it.

In `@cpp/tests/join/join_tests.cpp`:
- Around line 3206-3490: Add two new partitioned-probe tests: one that uses a
sliced probe table/view (non-zero row offset) to exercise offset handling and
one that uses a larger partition size that spans multiple device blocks to
exercise kernel-boundary logic. Implement the sliced-probe test alongside
HashJoinPartitionedInnerJoin/LeftJoin/FullJoin by creating a table with an
offset slice for the probe side and calling hash_joiner.partitioned_*_join with
part_ctx.left_start_idx/left_end_idx that reference non-zero offsets; validate
by comparing against the corresponding full join
(inner_join/left_join/full_join) sorted/equivalenced results. Implement the
multi-block-partition test similar to HashJoinPartitionedWholeTable but
construct probe and build tables large enough (or set
left_start_idx/left_end_idx large) so the partitioned join produces indices
spanning multiple GPU blocks, then gather and compare to the full join result.
Use the existing helpers: cudf::hash_join,
partitioned_inner_join/left_join/full_join, finalize_partitioned_full_join, and
CUDF_TEST_EXPECT_TABLES_EQUIVALENT to validate behavior.

---

Nitpick comments:
In `@cpp/src/join/hash_join/finalize_partitioned_full_join.cpp`:
- Around line 19-30: The public wrapper function
hash_join::finalize_partitioned_full_join should open an NVTX range: add a
CUDF_FUNC_RANGE() invocation as the first statement inside
finalize_partitioned_full_join (before calling cudf::detail::finalize_full_join)
so the public API is instrumented; ensure the macro is placed in the body of
finalize_partitioned_full_join around the delegation to finalize_full_join.

In `@cpp/src/join/hash_join/hash_join.cu`:
- Around line 262-287: The three public shim functions partitioned_inner_join,
partitioned_left_join, and partitioned_full_join should open an NVTX range
before delegating to _impl; add a CUDF_FUNC_RANGE() call as the first statement
in each function (i.e., inside hash_join::partitioned_inner_join,
hash_join::partitioned_left_join, and hash_join::partitioned_full_join) and then
return _impl->... as now so the public entry points mirror other cpp/src/*
public functions.

In `@cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh`:
- Around line 15-16: Replace usage of rmm::device_scalar with
cudf::detail::device_scalar for the temporary counter: update the includes and
any declarations where rmm::device_scalar<T> is used (e.g., the include/usage
around device_scalar in partitioned_retrieve_kernels.cuh and the instance
referenced near line 243) to cudf::detail::device_scalar<T>, and adjust or add
the appropriate cudf header if needed so the symbol resolves; ensure any
constructor and .value() calls remain compatible with
cudf::detail::device_scalar.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 266e6c51-490c-4ecc-b78f-88e91fe12a8c

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1620e and 185f5b1.

📒 Files selected for processing (28)

cpp/CMakeLists.txt
cpp/benchmarks/join/join.cu
cpp/include/cudf/detail/join/hash_join.hpp
cpp/include/cudf/join/hash_join.hpp
cpp/include/cudf/join/join.hpp
cpp/src/join/conditional_join.cu
cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
cpp/src/join/hash_join/hash_join.cu
cpp/src/join/hash_join/kernels_common.cuh
cpp/src/join/hash_join/match_context.cu
cpp/src/join/hash_join/partitioned_count.cu
cpp/src/join/hash_join/partitioned_count_kernels.cuh
cpp/src/join/hash_join/partitioned_count_kernels.hpp
cpp/src/join/hash_join/partitioned_count_outer.cu
cpp/src/join/hash_join/partitioned_full_join.cu
cpp/src/join/hash_join/partitioned_inner_join.cu
cpp/src/join/hash_join/partitioned_join_retrieve.cu
cpp/src/join/hash_join/partitioned_left_join.cu
cpp/src/join/hash_join/partitioned_retrieve.cu
cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp
cpp/src/join/hash_join/partitioned_retrieve_outer.cu
cpp/src/join/hash_join/ref_types.cuh
cpp/src/join/hash_join/retrieve_impl.cuh
cpp/src/join/join_common_utils.hpp
cpp/src/join/join_utils.cu
cpp/src/join/mixed_join.cu
cpp/tests/join/join_tests.cpp

…robe

- Validate join_partition_context bounds in partitioned_join_retrieve - Document throw conditions on new partitioned_*_join APIs - Add CUDF_FUNC_RANGE to new partitioned shims and finalize_*_full_join - Use cudf::detail::device_scalar and direct <utility> include - Add sliced + multi-block partitioned probe test

coderabbitai

🧹 Nitpick comments (1)

cpp/src/join/hash_join/partitioned_join_retrieve.cu (1)
124-131: 💤 Low value

Specify memory resource for temporary left_keys allocation.

Per coding guidelines, temporary memory should explicitly use cudf::get_current_device_resource_ref() rather than relying on the implicit default.
Suggested fix
-    rmm::device_uvector<probe_key_type> left_keys(n, stream);
+    rmm::device_uvector<probe_key_type> left_keys(n, stream, cudf::get_current_device_resource_ref());
As per coding guidelines: "Allocate temporary memory using cudf::get_current_device_resource_ref(), not the passed-in MR parameter"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu` around lines 124 - 131,
The temporary device_uvector left_keys in the retrieve_partition lambda is being
allocated without an explicit memory resource; change the allocation of
left_keys (rmm::device_uvector<probe_key_type> left_keys(...)) to pass
cudf::get_current_device_resource_ref() as the MR argument so the temporary uses
the current device resource instead of the implicit default; update the
rmm::device_uvector constructor call inside retrieve_partition accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu`:
- Around line 124-131: The temporary device_uvector left_keys in the
retrieve_partition lambda is being allocated without an explicit memory
resource; change the allocation of left_keys
(rmm::device_uvector<probe_key_type> left_keys(...)) to pass
cudf::get_current_device_resource_ref() as the MR argument so the temporary uses
the current device resource instead of the implicit default; update the
rmm::device_uvector constructor call inside retrieve_partition accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f736ad3-e0fa-412b-85a5-042a49fda753

📥 Commits

Reviewing files that changed from the base of the PR and between 5c24954 and be8c452.

📒 Files selected for processing (7)

cpp/include/cudf/join/hash_join.hpp
cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
cpp/src/join/hash_join/hash_join.cu
cpp/src/join/hash_join/partitioned_join_retrieve.cu
cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
cpp/src/join/hash_join/ref_types.cuh
cpp/tests/join/join_tests.cpp

✅ Files skipped from review due to trivial changes (2)

cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
cpp/src/join/hash_join/hash_join.cu

🚧 Files skipped from review as they are similar to previous changes (3)

cpp/include/cudf/join/hash_join.hpp
cpp/tests/join/join_tests.cpp
cpp/src/join/hash_join/ref_types.cuh

PointKernel · 2026-05-15T18:09:28Z

/ok to test 0a266a8

PointKernel added 3 commits April 10, 2026 13:22

Add partitioned probe support for hash joins

5ac9460

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

aac1997

…robe

Clean up rmm policy mr usage

f10754d

github-actions Bot assigned PointKernel Apr 10, 2026

github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 10, 2026

PointKernel changed the title ~~Chunked hash join probe~~ Add partitioned probe support for hash joins Apr 10, 2026

PointKernel added feature request New feature or request non-breaking Non-breaking change labels Apr 10, 2026

PointKernel added the Spark Functionality that helps Spark RAPIDS label Apr 10, 2026

PointKernel mentioned this pull request Apr 13, 2026

[FEA] Rewrite mixed join internals using normal join + pre/post-filtering #22124

Open

PointKernel added 2 commits April 13, 2026 20:07

Use custom kernels + file split

cfdfe93

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

c14cbba

…robe

PointKernel added 10 commits April 14, 2026 13:22

Remove redundant thrust::reduce in partitioned join retrieve

da2affd

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

86876d9

…robe

Use DEFAULT_JOIN_BLOCK_SIZE and DEFAULT_JOIN_CG_SIZE directly

0838160

Minor cleanup with modern CCCL

8cb82cd

Cleanups

451e43c

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

beead11

…robe

Improve retrieve kenrel perf with shared memory buffer

0f9c0d6

Replace full_join_complement with full_join_finalize

f9405b4

Use a common full join finalize logic

c0554bd

Speed up full_join finalize with consume-in-place and fused compact

5763980

PointKernel marked this pull request as ready for review April 17, 2026 23:48

PointKernel requested review from a team as code owners April 17, 2026 23:48

PointKernel requested a review from mythrocks April 17, 2026 23:48

PointKernel added 6 commits April 28, 2026 23:25

Rename full_join_finalize.cpp -> partitioned_full_join_finalize.cpp

67ffa6d

Rename to finalize_partitioned_full_join.cpp to match API name

250617a

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

a4b599a

…robe

Clean up stale comments: remove ported language, fix exclusive scan -…

2472ebc

…> reduce

Add doc comment to partitioned_count_kernel

ee81953

Formatting

f9a260f

PointKernel requested review from lamarrr, mhaseeb123 and shrshi April 28, 2026 23:46

Merge branch 'main' into chunked-hash-join-probe

5b5e0af

shrshi approved these changes May 4, 2026

View reviewed changes

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated

Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh

PointKernel added 3 commits May 4, 2026 23:34

Address shrshi review nits and rename cnt -> match_count

a00f6d6

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

efb52c5

…robe

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

6cd86fb

…robe

wence- requested changes May 5, 2026

View reviewed changes

PointKernel added 2 commits May 8, 2026 23:36

Minor cleanups

ab10f05

Replace raw int64_t with thread_index_type

185f5b1

PointKernel requested a review from wence- May 8, 2026 23:54

Use left/right naming

5c24954

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

Comment thread cpp/include/cudf/join/hash_join.hpp

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh

Comment thread cpp/src/join/hash_join/partitioned_join_retrieve.cu

Comment thread cpp/src/join/hash_join/ref_types.cuh

Comment thread cpp/tests/join/join_tests.cpp

PointKernel added 2 commits May 11, 2026 17:37

Merge remote-tracking branch 'upstream/main' into chunked-hash-join-p…

4ee3b7c

…robe

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Merge branch 'main' into chunked-hash-join-probe

0a266a8

	auto const n = static_cast<cuda::std::int64_t>(probe_table_num_rows);
	auto const n = static_cast<cuda::std::size_t>(probe_table_num_rows);

Conversation

PointKernel commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 10, 2026

Uh oh!

PointKernel commented Apr 10, 2026

Uh oh!

PointKernel commented Apr 14, 2026

Uh oh!

copy-pr-bot Bot commented Apr 28, 2026

Uh oh!

PointKernel commented Apr 28, 2026

Uh oh!

PointKernel commented May 4, 2026

Uh oh!

shrshi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

PointKernel commented May 4, 2026

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

wence- May 5, 2026

Choose a reason for hiding this comment

Uh oh!

PointKernel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wence- May 5, 2026

Choose a reason for hiding this comment

Uh oh!

PointKernel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

wence- May 5, 2026

Choose a reason for hiding this comment

Uh oh!

PointKernel May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

PointKernel commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

PointKernel commented Apr 10, 2026 •

edited

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading