Skip to content

Add partitioned probe support for hash joins#22108

Open
PointKernel wants to merge 38 commits into
rapidsai:mainfrom
PointKernel:chunked-hash-join-probe
Open

Add partitioned probe support for hash joins#22108
PointKernel wants to merge 38 commits into
rapidsai:mainfrom
PointKernel:chunked-hash-join-probe

Conversation

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel PointKernel commented Apr 10, 2026

Description

Closes #18677

This introduces partitioned probe support for hash joins and refactors the join internals to reduce duplication.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 10, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Apr 10, 2026
@PointKernel PointKernel changed the title Chunked hash join probe Add partitioned probe support for hash joins Apr 10, 2026
@PointKernel PointKernel added feature request New feature or request non-breaking Non-breaking change labels Apr 10, 2026
@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test f10754d

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test c14cbba

@PointKernel PointKernel marked this pull request as ready for review April 17, 2026 23:48
@PointKernel PointKernel requested review from a team as code owners April 17, 2026 23:48
@PointKernel PointKernel requested a review from mythrocks April 17, 2026 23:48
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test f9a260f

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test 5b5e0af

Copy link
Copy Markdown
Contributor

@shrshi shrshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two non-blocking nits, but looks great otherwise! :)

Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test 6cd86fb

Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting a few clarifications in the docstrings because every time I read the join docstrings I get confused.

*
* @return A pair of device vectors [`left_indices`, `right_indices`] for this partition
*/
[[nodiscard]] std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand how to use this function, and I must admit I am struggling from this docstring.

Suppose I am doing an inner join left \cap right. I think what I do is:

  • Create hash join object hasher with the build table right
  • call hasher.inner_join_match_context with the complete probe table left
  • Having observed that the output size is "too big", call hasher.partitioned_inner_join(partition_ctx) to obtain gather indices for the left and right tables (probe and build). The partition_ctx is the join_match_context plus slice bounds of the probe (left) table.
  • Gather from left and right (unsliced)

What does the join_partition_context buy me over just cudf::sliceing the left table once I have determined that the join output will be too big?

Is it that partitioned_left_join and partitioned_full_join need to manage more internal state so you want a symmetric interface?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

join_partition_context holds per-row match counts computed once by inner_join_match_context() over the full probe table. Each partitioned_inner_join uses a slice of it to size outputs and compute offsets, avoiding re-probing for counts. Calling inner_join per partition would redo that work. The full-join variant also uses the same context to carry build-side complement info, hence the symmetric design.

Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/include/cudf/join/hash_join.hpp
* Call this method after calling `partitioned_full_join()` for every partition. It combines
* the per-partition probe indices with the unmatched build row indices (a global property
* across all partitions) and returns a single `(left_indices, right_indices)` pair equivalent
* to the output of `full_join()`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Could we use this same facility to perform a left-join (say) where the left table is the "build" table and we want to keep track of which rows in the left table have no match in any right table so we can emit them with nulls appropriately?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have the utility to support the left table as the build table for full join, so it's not supported in the current setup.

Comment thread cpp/src/join/hash_join/match_context.cu Outdated
auto count_matches = [&](auto equality, auto d_hasher) {
auto const iter = cudf::detail::make_counting_transform_iterator(0, pair_fn{d_hasher});
// Precompute probe keys: {hash(row_idx), row_idx} for each probe row.
auto const n = static_cast<cuda::std::int64_t>(probe_table_num_rows);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto const n = static_cast<cuda::std::int64_t>(probe_table_num_rows);
auto const n = static_cast<cuda::std::size_t>(probe_table_num_rows);

?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional as we use the singed int64_t for all kernel thread indices to avoid overflow with large size_type.

using thread_index_type = int64_t; ///< Thread index type in kernels

Let me update it to thread_index_type for clarity.

Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh Outdated
Comment thread cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh Outdated
Comment thread cpp/src/join/join_utils.cu
@PointKernel PointKernel requested a review from wence- May 8, 2026 23:54
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack
No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: cc6efd0b-aa04-4fc8-bc13-6eb8250130a2

📥 Commits

Reviewing files that changed from the base of the PR and between be8c452 and 0a266a8.

📒 Files selected for processing (1)
  • cpp/CMakeLists.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/CMakeLists.txt

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • Partitioned hash-join execution added for inner, left, and full joins.
    • Benchmark mode to compare partitioned vs. standard join paths.
  • Improvements

    • Optimized partitioned probe/count/retrieve flow and a unified, more efficient full-join finalization.
  • Tests

    • Expanded join test coverage with parameterized cases validating partitioned workflows and edge conditions.

Walkthrough

Implements partitioned (count+retrieve) hash-join: new count/retrieve kernels and launchers, partition-level inner/left/full APIs, per-partition finalization, refactor of full-join finalization, explicit template instantiations, tests, benchmark mode, and CMake additions.

Changes

Partitioned Hash Join Feature

Layer / File(s) Summary
Public API Contracts & Semantics
cpp/include/cudf/join/join.hpp, cpp/include/cudf/join/hash_join.hpp, cpp/include/cudf/detail/join/hash_join.hpp, cpp/src/join/hash_join/hash_join.cu, cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
Adds partitioned join declarations and public wrappers (partitioned_inner_join, partitioned_left_join, partitioned_full_join) and finalize_partitioned_full_join; makes join_match_context non-copyable and movable.
Kernel Type & Reference Infrastructure
cpp/src/join/hash_join/kernels_common.cuh, cpp/src/join/hash_join/ref_types.cuh
Introduces probe_key_type and centralized equality/count-ref type aliases for primitive/nested/flat dispatch paths used by kernels.
Count Phase
cpp/src/join/hash_join/partitioned_count_kernels.hpp, cpp/src/join/hash_join/partitioned_count_kernels.cuh, cpp/src/join/hash_join/partitioned_count.cu, cpp/src/join/hash_join/partitioned_count_outer.cu, cpp/src/join/hash_join/match_context.cu
Adds partitioned match-count kernel and host launcher with warp/tile reduction and outer-join clamping; match_context now precomputes probe keys and calls launch_partitioned_count.
Retrieve Phase
cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp, cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh, cpp/src/join/hash_join/partitioned_retrieve.cu, cpp/src/join/hash_join/partitioned_retrieve_outer.cu
Adds partitioned retrieve kernel with shared buffering and atomic-amortized global writes; launcher sizes and allocates output vectors from match counts and supports outer-join JoinNoMatch.
Partitioned Join Orchestration
cpp/src/join/hash_join/partitioned_join_retrieve.cu, cpp/src/join/hash_join/partitioned_inner_join.cu, cpp/src/join/hash_join/partitioned_left_join.cu, cpp/src/join/hash_join/partitioned_full_join.cu
Orchestrates per-partition work: empty-left/build cases, probe slicing, key preprocessing, comparator dispatch, and launches count+retrieve; thin wrappers per join kind.
Full-Join Finalization Consolidation
cpp/src/join/join_common_utils.hpp, cpp/src/join/join_utils.cu, cpp/src/join/hash_join/retrieve_impl.cuh, cpp/src/join/conditional_join.cu, cpp/src/join/mixed_join.cu
Centralizes finalize_full_join overloads handling single-probe and per-partition partials, replacing earlier complement/concatenate helpers with scatter/copy_if-based unmatched-row emission.
Build, Benchmarks & Tests
cpp/CMakeLists.txt, cpp/benchmarks/join/join.cu, cpp/tests/join/join_tests.cpp
Registers new sources in CMake; benchmark nvbench_inner_join gains mode axis (normal,partitioned); tests extended with partitioned inner/left/full and edge-case coverage.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • mhaseeb123
  • mythrocks
  • wence-
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'Add partitioned probe support for hash joins' directly and clearly describes the main change: introducing partitioned probe functionality for hash joins.
Description check ✅ Passed The PR description relates to the changeset by mentioning partitioned probe support, referencing the related issue #18677, and indicating tests and documentation were updated.
Linked Issues check ✅ Passed The PR implements partitioned probe support for hash joins as requested in #18677, adding new partitioned_inner_join, partitioned_left_join, and partitioned_full_join APIs that enable chunked probe processing with count-then-retrieve phases.
Out of Scope Changes check ✅ Passed All changes support the core objective of adding partitioned probe APIs and associated infrastructure (kernels, utilities, tests). Refactoring of full-join finalization logic consolidates duplication and directly supports the partitioned implementation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh (1)

15-16: ⚡ Quick win

Use cudf::detail::device_scalar for the temporary counter.

This temporary counter should follow the repo convention instead of using rmm::device_scalar directly.

As per coding guidelines, "Use cudf::detail::device_scalar instead of rmm::device_scalar."

Suggested fix
-#include <rmm/device_scalar.hpp>
+#include <cudf/detail/utilities/device_scalar.hpp>
@@
-  rmm::device_scalar<size_type> output_counter(size_type{0}, stream);
+  cudf::detail::device_scalar<size_type> output_counter(size_type{0}, stream);

Also applies to: 243-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh` around lines 15 -
16, Replace usage of rmm::device_scalar with cudf::detail::device_scalar for the
temporary counter: update the includes and any declarations where
rmm::device_scalar<T> is used (e.g., the include/usage around device_scalar in
partitioned_retrieve_kernels.cuh and the instance referenced near line 243) to
cudf::detail::device_scalar<T>, and adjust or add the appropriate cudf header if
needed so the symbol resolves; ensure any constructor and .value() calls remain
compatible with cudf::detail::device_scalar.
cpp/src/join/hash_join/hash_join.cu (1)

262-287: ⚡ Quick win

Add CUDF_FUNC_RANGE() to the new public hash-join entry points.

These methods are public API shims that delegate directly to _impl, so they should open an NVTX range like the other public entry points under cpp/src/.

Suggested fix
 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_inner_join(cudf::join_partition_context const& context,
                                   rmm::cuda_stream_view stream,
                                   rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_inner_join(context, stream, mr);
 }
 
 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_left_join(cudf::join_partition_context const& context,
                                  rmm::cuda_stream_view stream,
                                  rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_left_join(context, stream, mr);
 }
 
 std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
           std::unique_ptr<rmm::device_uvector<size_type>>>
 hash_join::partitioned_full_join(cudf::join_partition_context const& context,
                                  rmm::cuda_stream_view stream,
                                  rmm::device_async_resource_ref mr) const
 {
+  CUDF_FUNC_RANGE();
   return _impl->partitioned_full_join(context, stream, mr);
 }

As per coding guidelines, cpp/src/**/*.{cu,cpp}: Use CUDF_FUNC_RANGE() in public functions before delegating to detail:: functions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/hash_join.cu` around lines 262 - 287, The three public
shim functions partitioned_inner_join, partitioned_left_join, and
partitioned_full_join should open an NVTX range before delegating to _impl; add
a CUDF_FUNC_RANGE() call as the first statement in each function (i.e., inside
hash_join::partitioned_inner_join, hash_join::partitioned_left_join, and
hash_join::partitioned_full_join) and then return _impl->... as now so the
public entry points mirror other cpp/src/* public functions.
cpp/src/join/hash_join/finalize_partitioned_full_join.cpp (1)

19-30: ⚡ Quick win

Instrument this public finalizer with CUDF_FUNC_RANGE().

This is another public API wrapper that immediately forwards into detail::finalize_full_join, so it should open an NVTX range as well.

Suggested fix
 `#include` "join/join_common_utils.hpp"
 
+#include <cudf/detail/nvtx/ranges.hpp>
 `#include` <cudf/join/hash_join.hpp>
 `#include` <cudf/join/join.hpp>
 `#include` <cudf/types.hpp>
 `#include` <cudf/utilities/memory_resource.hpp>
 `#include` <cudf/utilities/span.hpp>
@@
 hash_join::finalize_partitioned_full_join(
   cudf::host_span<cudf::device_span<size_type const> const> left_partials,
   cudf::host_span<cudf::device_span<size_type const> const> right_partials,
   size_type probe_table_num_rows,
   size_type build_table_num_rows,
   rmm::cuda_stream_view stream,
   rmm::device_async_resource_ref mr)
 {
+  CUDF_FUNC_RANGE();
   return cudf::detail::finalize_full_join(
     left_partials, right_partials, probe_table_num_rows, build_table_num_rows, stream, mr);
 }

As per coding guidelines, cpp/src/**/*.{cu,cpp}: Use CUDF_FUNC_RANGE() in public functions before delegating to detail:: functions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/finalize_partitioned_full_join.cpp` around lines 19 -
30, The public wrapper function hash_join::finalize_partitioned_full_join should
open an NVTX range: add a CUDF_FUNC_RANGE() invocation as the first statement
inside finalize_partitioned_full_join (before calling
cudf::detail::finalize_full_join) so the public API is instrumented; ensure the
macro is placed in the body of finalize_partitioned_full_join around the
delegation to finalize_full_join.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/include/cudf/join/hash_join.hpp`:
- Around line 312-410: Update the Doxygen for the four new public
APIs—partitioned_inner_join, partitioned_left_join, partitioned_full_join, and
finalize_partitioned_full_join—to include `@throw` tags that enumerate the
precondition/invalid-context failures and the exception types users should
expect; specifically document cases such as: context not created by the
corresponding *_match_context, context partition bounds outside the probe table,
mismatched sizes or inconsistent left_partials/right_partials spans, invalid
probe_table_num_rows or build_table_num_rows, and any std::invalid_argument or
std::logic_error (or specific cudf exception types used by these functions)
thrown on those violations so callers know what to catch. Ensure each function’s
comment block contains a brief `@throw` line describing the condition and the
exception type and reference the function names (partitioned_inner_join,
partitioned_left_join, partitioned_full_join, finalize_partitioned_full_join) so
the reader can map failures to the correct API.

In `@cpp/src/join/hash_join/partitioned_count_kernels.cuh`:
- Around line 89-90: After launching partitioned_count_kernel<IsOuter> add an
immediate kernel launch error check by calling
CUDF_CUDA_TRY(cudaGetLastError()); right after the <<<...>>> invocation, and add
the required include <cudf/utilities/error.hpp> at the top of the file so
CUDF_CUDA_TRY is available; this ensures any launch failures are caught and
reported immediately for the partitioned_count_kernel<IsOuter> invocation.

In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu`:
- Around line 69-105: Validate the join_partition_context (ensure
context.left_table_context is non-null and that left_start_idx and left_end_idx
form a valid range within the left table and within match_counts size) before
dereferencing match_ctx or slicing; if the context is malformed (null
left_table_context, left_start_idx > left_end_idx, or either index outside the
left table row count or match_counts length) return the appropriate empty-result
(same as the current empty-partition/empty-build branches) or throw a clear
error, and only then proceed to use match_ctx,
cudf::slice(probe_partition_view), _match_counts (partition_counts),
validate_hash_join_probe, and
cudf::detail::row::equality::preprocessed_table::create.

In `@cpp/src/join/hash_join/ref_types.cuh`:
- Around line 12-16: Add a direct include of <utility> to ref_types.cuh to bring
in std::declval (used in the decltype expressions around the ref type
detection), removing the transitive dependency on dispatch.cuh; update the top
of ref_types.cuh (alongside the existing includes of "dispatch.cuh",
"hash_join_impl.cuh", and <cuco/static_multiset.cuh>) to include <utility> so
declval is available even if dispatch.cuh no longer exposes it.

In `@cpp/tests/join/join_tests.cpp`:
- Around line 3206-3490: Add two new partitioned-probe tests: one that uses a
sliced probe table/view (non-zero row offset) to exercise offset handling and
one that uses a larger partition size that spans multiple device blocks to
exercise kernel-boundary logic. Implement the sliced-probe test alongside
HashJoinPartitionedInnerJoin/LeftJoin/FullJoin by creating a table with an
offset slice for the probe side and calling hash_joiner.partitioned_*_join with
part_ctx.left_start_idx/left_end_idx that reference non-zero offsets; validate
by comparing against the corresponding full join
(inner_join/left_join/full_join) sorted/equivalenced results. Implement the
multi-block-partition test similar to HashJoinPartitionedWholeTable but
construct probe and build tables large enough (or set
left_start_idx/left_end_idx large) so the partitioned join produces indices
spanning multiple GPU blocks, then gather and compare to the full join result.
Use the existing helpers: cudf::hash_join,
partitioned_inner_join/left_join/full_join, finalize_partitioned_full_join, and
CUDF_TEST_EXPECT_TABLES_EQUIVALENT to validate behavior.

---

Nitpick comments:
In `@cpp/src/join/hash_join/finalize_partitioned_full_join.cpp`:
- Around line 19-30: The public wrapper function
hash_join::finalize_partitioned_full_join should open an NVTX range: add a
CUDF_FUNC_RANGE() invocation as the first statement inside
finalize_partitioned_full_join (before calling cudf::detail::finalize_full_join)
so the public API is instrumented; ensure the macro is placed in the body of
finalize_partitioned_full_join around the delegation to finalize_full_join.

In `@cpp/src/join/hash_join/hash_join.cu`:
- Around line 262-287: The three public shim functions partitioned_inner_join,
partitioned_left_join, and partitioned_full_join should open an NVTX range
before delegating to _impl; add a CUDF_FUNC_RANGE() call as the first statement
in each function (i.e., inside hash_join::partitioned_inner_join,
hash_join::partitioned_left_join, and hash_join::partitioned_full_join) and then
return _impl->... as now so the public entry points mirror other cpp/src/*
public functions.

In `@cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh`:
- Around line 15-16: Replace usage of rmm::device_scalar with
cudf::detail::device_scalar for the temporary counter: update the includes and
any declarations where rmm::device_scalar<T> is used (e.g., the include/usage
around device_scalar in partitioned_retrieve_kernels.cuh and the instance
referenced near line 243) to cudf::detail::device_scalar<T>, and adjust or add
the appropriate cudf header if needed so the symbol resolves; ensure any
constructor and .value() calls remain compatible with
cudf::detail::device_scalar.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 266e6c51-490c-4ecc-b78f-88e91fe12a8c

📥 Commits

Reviewing files that changed from the base of the PR and between 0a1620e and 185f5b1.

📒 Files selected for processing (28)
  • cpp/CMakeLists.txt
  • cpp/benchmarks/join/join.cu
  • cpp/include/cudf/detail/join/hash_join.hpp
  • cpp/include/cudf/join/hash_join.hpp
  • cpp/include/cudf/join/join.hpp
  • cpp/src/join/conditional_join.cu
  • cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
  • cpp/src/join/hash_join/hash_join.cu
  • cpp/src/join/hash_join/kernels_common.cuh
  • cpp/src/join/hash_join/match_context.cu
  • cpp/src/join/hash_join/partitioned_count.cu
  • cpp/src/join/hash_join/partitioned_count_kernels.cuh
  • cpp/src/join/hash_join/partitioned_count_kernels.hpp
  • cpp/src/join/hash_join/partitioned_count_outer.cu
  • cpp/src/join/hash_join/partitioned_full_join.cu
  • cpp/src/join/hash_join/partitioned_inner_join.cu
  • cpp/src/join/hash_join/partitioned_join_retrieve.cu
  • cpp/src/join/hash_join/partitioned_left_join.cu
  • cpp/src/join/hash_join/partitioned_retrieve.cu
  • cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
  • cpp/src/join/hash_join/partitioned_retrieve_kernels.hpp
  • cpp/src/join/hash_join/partitioned_retrieve_outer.cu
  • cpp/src/join/hash_join/ref_types.cuh
  • cpp/src/join/hash_join/retrieve_impl.cuh
  • cpp/src/join/join_common_utils.hpp
  • cpp/src/join/join_utils.cu
  • cpp/src/join/mixed_join.cu
  • cpp/tests/join/join_tests.cpp

Comment thread cpp/include/cudf/join/hash_join.hpp
Comment thread cpp/src/join/hash_join/partitioned_count_kernels.cuh
Comment thread cpp/src/join/hash_join/partitioned_join_retrieve.cu
Comment thread cpp/src/join/hash_join/ref_types.cuh
Comment thread cpp/tests/join/join_tests.cpp
- Validate join_partition_context bounds in partitioned_join_retrieve
- Document throw conditions on new partitioned_*_join APIs
- Add CUDF_FUNC_RANGE to new partitioned shims and finalize_*_full_join
- Use cudf::detail::device_scalar and direct <utility> include
- Add sliced + multi-block partitioned probe test
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/src/join/hash_join/partitioned_join_retrieve.cu (1)

124-131: 💤 Low value

Specify memory resource for temporary left_keys allocation.

Per coding guidelines, temporary memory should explicitly use cudf::get_current_device_resource_ref() rather than relying on the implicit default.

Suggested fix
-    rmm::device_uvector<probe_key_type> left_keys(n, stream);
+    rmm::device_uvector<probe_key_type> left_keys(n, stream, cudf::get_current_device_resource_ref());

As per coding guidelines: "Allocate temporary memory using cudf::get_current_device_resource_ref(), not the passed-in MR parameter"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu` around lines 124 - 131,
The temporary device_uvector left_keys in the retrieve_partition lambda is being
allocated without an explicit memory resource; change the allocation of
left_keys (rmm::device_uvector<probe_key_type> left_keys(...)) to pass
cudf::get_current_device_resource_ref() as the MR argument so the temporary uses
the current device resource instead of the implicit default; update the
rmm::device_uvector constructor call inside retrieve_partition accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/src/join/hash_join/partitioned_join_retrieve.cu`:
- Around line 124-131: The temporary device_uvector left_keys in the
retrieve_partition lambda is being allocated without an explicit memory
resource; change the allocation of left_keys
(rmm::device_uvector<probe_key_type> left_keys(...)) to pass
cudf::get_current_device_resource_ref() as the MR argument so the temporary uses
the current device resource instead of the implicit default; update the
rmm::device_uvector constructor call inside retrieve_partition accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f736ad3-e0fa-412b-85a5-042a49fda753

📥 Commits

Reviewing files that changed from the base of the PR and between 5c24954 and be8c452.

📒 Files selected for processing (7)
  • cpp/include/cudf/join/hash_join.hpp
  • cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
  • cpp/src/join/hash_join/hash_join.cu
  • cpp/src/join/hash_join/partitioned_join_retrieve.cu
  • cpp/src/join/hash_join/partitioned_retrieve_kernels.cuh
  • cpp/src/join/hash_join/ref_types.cuh
  • cpp/tests/join/join_tests.cpp
✅ Files skipped from review due to trivial changes (2)
  • cpp/src/join/hash_join/finalize_partitioned_full_join.cpp
  • cpp/src/join/hash_join/hash_join.cu
🚧 Files skipped from review as they are similar to previous changes (3)
  • cpp/include/cudf/join/hash_join.hpp
  • cpp/tests/join/join_tests.cpp
  • cpp/src/join/hash_join/ref_types.cuh

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test 0a266a8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS

Projects

Status: Burndown

Development

Successfully merging this pull request may close these issues.

[FEA] Add support for chunked probe in hash joins

6 participants