Switch to public RAFT numpy_serialize APIs#2038
Switch to public RAFT numpy_serialize APIs#2038julianmi wants to merge 2 commits intorapidsai:mainfrom
Conversation
📝 WalkthroughSummary by CodeRabbit
WalkthroughMultiple C++ source and header files are updated to use the public Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@c/src/neighbors/brute_force.cpp`:
- Around line 242-244: The code reads 4 bytes into dtype_string using is.read
and immediately calls raft::numpy_serializer::parse_descr on those bytes;
however it doesn't check whether the read succeeded. Update the block around
dtype_string/is.read to verify the stream read (e.g., check is.gcount() == 4 or
is.fail()/is.good()) before calling raft::numpy_serializer::parse_descr, and on
failure handle the truncated header by returning an error/throwing an exception
or logging and aborting the parse so parse_descr never receives uninitialized
data.
In `@c/src/neighbors/cagra.cpp`:
- Around line 877-880: The code reads 4 bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without validating the read; first check the
read succeeded (e.g., verify is.read(...) and that is.gcount() == 4 or that the
stream is in a good state) before constructing std::string(dtype_string, 4) and
calling raft::numpy_serializer::parse_descr; if the read fails/returns fewer
than 4 bytes, handle the error (throw, return an error code, or log and abort)
instead of passing potentially truncated data to parse_descr.
In `@c/src/neighbors/ivf_flat.cpp`:
- Around line 303-306: The code reads four bytes into dtype_string then calls
raft::numpy_serializer::parse_descr without verifying the read succeeded;
validate the read length immediately after is.read(dtype_string, 4) (e.g., check
is.gcount() == 4 or test stream state like if (!is || is.gcount() != 4)) and on
short reads/failure throw or return a clear error (or set an error status)
before calling raft::numpy_serializer::parse_descr, so malformed or truncated
files are handled safely.
In `@c/src/neighbors/mg_cagra.cpp`:
- Around line 403-406: The 4-byte dtype header read into dtype_string before
calling raft::numpy_serializer::parse_descr is not validated; check the istream
read result (e.g., is.read(...) and then is.gcount() == 4 or !is.fail()) and
handle short/truncated reads by logging/throwing/returning an error instead of
calling parse_descr with partial data; apply the same validation to the other
occurrence that reads 4 bytes so neither is.read -> parse_descr path can receive
garbage.
In `@c/src/neighbors/mg_ivf_flat.cpp`:
- Around line 400-403: The code reads 4 bytes into dtype_string and calls
raft::numpy_serializer::parse_descr without validating the read; update both the
deserialize and distribute code paths to check is.read(...) and the stream state
(e.g., verify gcount() == 4 or is.good()/is) before constructing
std::string(dtype_string,4) and calling parse_descr, and handle truncated reads
by returning/throwing an error or reporting via existing error path; reference
the dtype_string buffer, the is.read call, and the parse_descr invocation so you
change both occurrences (around the calls near deserialize and distribute).
In `@c/src/neighbors/mg_ivf_pq.cpp`:
- Around line 392-395: The code calls is.read into dtype_string and immediately
hands those bytes to raft::numpy_serializer::parse_descr without checking the
read result; update the logic around dtype_string/is.read to validate that 4
bytes were actually read (e.g., check is.gcount() == 4 and/or
is.good()/is.fail()) before calling parse_descr, and handle the short-read error
path (return an error, throw, or log and exit) so parse_descr is never called
with incomplete data; reference the dtype_string buffer, the is.read call, and
raft::numpy_serializer::parse_descr when making the change and ensure is.close()
still runs in all code paths.
In `@cpp/cmake/thirdparty/get_raft.cmake`:
- Around line 9-10: The CMake variables RAFT_FORK and RAFT_PINNED_TAG are
hard-pinned to a personal fork/branch; change the defaults to use the official
upstream (e.g., "rapidsai/raft") and a safe default tag (empty or a release
tag), and expose RAFT_FORK and RAFT_PINNED_TAG as CMake CACHE variables so they
can only be overridden explicitly via -D on the command line; additionally gate
any non-upstream usage behind an opt-in flag (e.g., USE_CUSTOM_RAFT or
RAFT_USE_FORK) so CI/releases use the official repo by default and personal-fork
overrides are explicitly documented and temporary.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 393379db-9f11-4936-953c-40a2df4cec4c
📒 Files selected for processing (15)
c/src/neighbors/brute_force.cppc/src/neighbors/cagra.cppc/src/neighbors/ivf_flat.cppc/src/neighbors/mg_cagra.cppc/src/neighbors/mg_ivf_flat.cppc/src/neighbors/mg_ivf_pq.cppcpp/cmake/thirdparty/get_raft.cmakecpp/include/cuvs/neighbors/cagra.hppcpp/include/cuvs/util/file_io.hppcpp/src/neighbors/brute_force_serialize.cucpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/cagra/cagra_serialize.cuhcpp/src/neighbors/detail/hnsw.hppcpp/src/neighbors/ivf_flat/ivf_flat_serialize.cuhcpp/src/neighbors/mg/snmg.cuh
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); |
There was a problem hiding this comment.
Guard against truncated header reads before parsing dtype.
Line 243 reads 4 bytes but never checks success. On short/corrupt files, Line 244 may parse uninitialized bytes.
💡 Proposed fix
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4]{};
+ if (!is.read(dtype_string, sizeof(dtype_string))) {
+ RAFT_FAIL("Invalid or truncated index header in file %s", filename);
+ }
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/brute_force.cpp` around lines 242 - 244, The code reads 4
bytes into dtype_string using is.read and immediately calls
raft::numpy_serializer::parse_descr on those bytes; however it doesn't check
whether the read succeeded. Update the block around dtype_string/is.read to
verify the stream read (e.g., check is.gcount() == 4 or is.fail()/is.good())
before calling raft::numpy_serializer::parse_descr, and on failure handle the
truncated header by returning an error/throwing an exception or logging and
aborting the parse so parse_descr never receives uninitialized data.
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
|
|
There was a problem hiding this comment.
Guard the dtype prefix read before parse_descr.
Line 878 reads 4 bytes but does not validate read length before parsing on Line 879. Corrupt/truncated files can produce invalid dtype decoding.
Proposed fix
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4] = {};
+ is.read(dtype_string, sizeof(dtype_string));
+ RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)),
+ "Failed to read dtype header from %s",
+ filename);
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| char dtype_string[4]; | |
| is.read(dtype_string, 4); | |
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | |
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | |
| char dtype_string[4] = {}; | |
| is.read(dtype_string, sizeof(dtype_string)); | |
| RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)), | |
| "Failed to read dtype header from %s", | |
| filename); | |
| auto dtype = | |
| raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string))); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/cagra.cpp` around lines 877 - 880, The code reads 4 bytes
into dtype_string then calls raft::numpy_serializer::parse_descr without
validating the read; first check the read succeeded (e.g., verify is.read(...)
and that is.gcount() == 4 or that the stream is in a good state) before
constructing std::string(dtype_string, 4) and calling
raft::numpy_serializer::parse_descr; if the read fails/returns fewer than 4
bytes, handle the error (throw, return an error code, or log and abort) instead
of passing potentially truncated data to parse_descr.
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
|
|
There was a problem hiding this comment.
Fail early on short dtype-header reads.
Line 304 reads the dtype prefix without checking byte count before parsing on Line 305. This should be validated to handle malformed files safely.
Proposed fix
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4] = {};
+ is.read(dtype_string, sizeof(dtype_string));
+ RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)),
+ "Failed to read dtype header from %s",
+ filename);
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/ivf_flat.cpp` around lines 303 - 306, The code reads four
bytes into dtype_string then calls raft::numpy_serializer::parse_descr without
verifying the read succeeded; validate the read length immediately after
is.read(dtype_string, 4) (e.g., check is.gcount() == 4 or test stream state like
if (!is || is.gcount() != 4)) and on short reads/failure throw or return a clear
error (or set an error status) before calling
raft::numpy_serializer::parse_descr, so malformed or truncated files are handled
safely.
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| is.close(); |
There was a problem hiding this comment.
Validate dtype header reads before parsing.
On Line 404 and Line 435, the 4-byte read is not validated. Truncated files can pass partial/garbage bytes into parse_descr, causing invalid dtype dispatch.
Proposed fix
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4] = {};
+ is.read(dtype_string, sizeof(dtype_string));
+ RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)),
+ "Failed to read dtype header from %s",
+ filename);
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));
@@
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4] = {};
+ is.read(dtype_string, sizeof(dtype_string));
+ RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)),
+ "Failed to read dtype header from %s",
+ filename);
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));Also applies to: 434-437
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/mg_cagra.cpp` around lines 403 - 406, The 4-byte dtype header
read into dtype_string before calling raft::numpy_serializer::parse_descr is not
validated; check the istream read result (e.g., is.read(...) and then
is.gcount() == 4 or !is.fail()) and handle short/truncated reads by
logging/throwing/returning an error instead of calling parse_descr with partial
data; apply the same validation to the other occurrence that reads 4 bytes so
neither is.read -> parse_descr path can receive garbage.
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| is.close(); |
There was a problem hiding this comment.
Validate header read length before calling parse_descr in both paths.
Both blocks parse the dtype descriptor immediately after read(...) without checking stream state. Truncated files can lead to undefined behavior due to partially initialized buffers.
💡 Proposed fix (apply in both deserialize and distribute)
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4]{};
+ if (!is.read(dtype_string, sizeof(dtype_string))) {
+ RAFT_FAIL("Invalid or truncated index header in file %s", filename);
+ }
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));Also applies to: 431-434
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/mg_ivf_flat.cpp` around lines 400 - 403, The code reads 4
bytes into dtype_string and calls raft::numpy_serializer::parse_descr without
validating the read; update both the deserialize and distribute code paths to
check is.read(...) and the stream state (e.g., verify gcount() == 4 or
is.good()/is) before constructing std::string(dtype_string,4) and calling
parse_descr, and handle truncated reads by returning/throwing an error or
reporting via existing error path; reference the dtype_string buffer, the
is.read call, and the parse_descr invocation so you change both occurrences
(around the calls near deserialize and distribute).
| char dtype_string[4]; | ||
| is.read(dtype_string, 4); | ||
| auto dtype = raft::detail::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4)); | ||
| is.close(); |
There was a problem hiding this comment.
Validate read() result before dtype parsing.
Line 393 does not verify that 4 bytes were actually read. A short read can propagate invalid bytes into parse_descr on Line 394.
Proposed fix
- char dtype_string[4];
- is.read(dtype_string, 4);
- auto dtype = raft::numpy_serializer::parse_descr(std::string(dtype_string, 4));
+ char dtype_string[4] = {};
+ is.read(dtype_string, sizeof(dtype_string));
+ RAFT_EXPECTS(is.gcount() == static_cast<std::streamsize>(sizeof(dtype_string)),
+ "Failed to read dtype header from %s",
+ filename);
+ auto dtype =
+ raft::numpy_serializer::parse_descr(std::string(dtype_string, sizeof(dtype_string)));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@c/src/neighbors/mg_ivf_pq.cpp` around lines 392 - 395, The code calls is.read
into dtype_string and immediately hands those bytes to
raft::numpy_serializer::parse_descr without checking the read result; update the
logic around dtype_string/is.read to validate that 4 bytes were actually read
(e.g., check is.gcount() == 4 and/or is.good()/is.fail()) before calling
parse_descr, and handle the short-read error path (return an error, throw, or
log and exit) so parse_descr is never called with incomplete data; reference the
dtype_string buffer, the is.read call, and raft::numpy_serializer::parse_descr
when making the change and ensure is.close() still runs in all code paths.
| set(RAFT_FORK "julianmi") | ||
| set(RAFT_PINNED_TAG "expose-public-npy-helpers") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Checking upstream dependency status (rapidsai/raft PR `#3003`)..."
curl -s https://api.github.com/repos/rapidsai/raft/pulls/3003 \
| jq '{state, merged, merge_commit_sha, head_repo: .head.repo.full_name, head_ref: .head.ref, base_ref: .base.ref}'
echo
echo "Checking whether expose-public-npy-helpers exists in official and fork repos..."
git ls-remote --heads https://github.com/rapidsai/raft.git expose-public-npy-helpers || true
git ls-remote --tags https://github.com/rapidsai/raft.git expose-public-npy-helpers || true
git ls-remote --heads https://github.com/julianmi/raft.git expose-public-npy-helpers || true
git ls-remote --tags https://github.com/julianmi/raft.git expose-public-npy-helpers || trueRepository: rapidsai/cuvs
Length of output: 407
🏁 Script executed:
cat -n cpp/cmake/thirdparty/get_raft.cmakeRepository: rapidsai/cuvs
Length of output: 3417
🏁 Script executed:
# Search for any comments or related discussions about this fork usage
rg "expose-public-npy-helpers|RAFT_FORK|julianmi" --type cmake --type markdown -C 3Repository: rapidsai/cuvs
Length of output: 9046
🏁 Script executed:
# Check git history for when this fork was introduced and any commit messages
git log --oneline -20 -- cpp/cmake/thirdparty/get_raft.cmakeRepository: rapidsai/cuvs
Length of output: 109
🏁 Script executed:
# Search for any open or recent PRs that might explain the fork usage
git log --all --oneline --grep="raft\|RAFT" | head -30Repository: rapidsai/cuvs
Length of output: 1759
Avoid hard-pinning production builds to a personal RAFT fork/ref.
This introduces a supply-chain risk for CI and releases if that fork/ref is rewritten or removed. The expose-public-npy-helpers branch currently exists in julianmi/raft, but pinning to a personal fork without a documented integration path or timeline creates long-term dependency fragility. Prefer official rapidsai/raft by default, and gate fork/ref overrides behind explicit opt-in CMake cache vars for temporary cross-PR testing.
Suggested hardening
-set(RAFT_FORK "julianmi")
-set(RAFT_PINNED_TAG "expose-public-npy-helpers")
+set(RAFT_FORK "rapidsai")
+set(RAFT_PINNED_TAG "${rapids-cmake-checkout-tag}")
+
+set(CUVS_RAFT_FORK_OVERRIDE "" CACHE STRING "Optional RAFT fork override for temporary testing")
+set(CUVS_RAFT_TAG_OVERRIDE "" CACHE STRING "Optional RAFT tag/branch override for temporary testing")
+
+if(CUVS_RAFT_FORK_OVERRIDE)
+ set(RAFT_FORK "${CUVS_RAFT_FORK_OVERRIDE}")
+endif()
+if(CUVS_RAFT_TAG_OVERRIDE)
+ set(RAFT_PINNED_TAG "${CUVS_RAFT_TAG_OVERRIDE}")
+endif()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/cmake/thirdparty/get_raft.cmake` around lines 9 - 10, The CMake variables
RAFT_FORK and RAFT_PINNED_TAG are hard-pinned to a personal fork/branch; change
the defaults to use the official upstream (e.g., "rapidsai/raft") and a safe
default tag (empty or a release tag), and expose RAFT_FORK and RAFT_PINNED_TAG
as CMake CACHE variables so they can only be overridden explicitly via -D on the
command line; additionally gate any non-upstream usage behind an opt-in flag
(e.g., USE_CUSTOM_RAFT or RAFT_USE_FORK) so CI/releases use the official repo by
default and personal-fork overrides are explicitly documented and temporary.
cuVS uses
parse_descr,read_header,get_numpy_dtype,write_headerfromraft::detail::numpy_serializeras discussed here. This PR proposes to expose them publicly in RAFT instead and use them throughout cuVS.There are similar issues with the
raft/core/detail/macros.hppto be address in a separate PR.Depends on rapidsai/raft#3003