Scripts for Billion-Scale Synthetic Data Generation by jinsolp · Pull Request #2247 · rapidsai/cuvs

jinsolp · 2026-06-17T00:38:00Z

Adding billion-scale synthetic data generation scripts to cuvs_bench.

New Files in synthesize_dataset/:

__main__.py: fit/generate/verify CLI.
_fit.py: KMeans + per-cluster PCA fitting logic.
_fingerprint.py: Fingerprint class.
_generate.py: per-cluster data generation logic.
_ground_truth.py: exact (streaming) and nprobe GT computation.
_verify.py: compares nprobe GT vs exact GT to validate nprobes.
_io.py — dataset loading (support for npz and pkl files) + fingerprint NPZ save/load logic.
README.md + figures/: full workflow guide and synth-vs-real DiskANN validation on Falcon/BigANN/Wiki.

For Reviewers: It would be easier to read through the README.md and review code for each step in that order (starting with __main__.py.

copy-pr-bot · 2026-06-17T00:38:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

jinsolp · 2026-06-17T00:38:34Z

/ok to test

copy-pr-bot · 2026-06-17T00:38:37Z

/ok to test

@jinsolp, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

jinsolp · 2026-06-17T00:39:13Z

/ok to test 8fcb3b2

coderabbitai · 2026-06-17T00:48:07Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

New Features
- Added synthesize_dataset module with CLI tool for generating synthetic vector datasets and ground truth at scale for benchmarking
- Three-command interface: fit (learn dataset fingerprint), generate (create synthetic data/queries/ground truth), and verify (check accuracy)
- Support for random jitter query generation mode
- Ground truth computation in exact and approximate modes
Documentation
- Added comprehensive README with workflow, examples, and API reference for synthetic dataset generation

Walkthrough

Introduces a new cuvs_bench.synthesize_dataset package for generating GPU-accelerated billion-scale synthetic ANN benchmark datasets from fitted cluster fingerprints, including exact and nprobe ground-truth computation, a three-subcommand CLI, and full documentation. Also extends generate_groundtruth with a random-jitter query generation mode backed by new shared utility functions.

Changes

Billion-Scale Synthetic Dataset Generator and Jitter Query Support

Layer / File(s)	Summary
Shared jitter utilities and generate_groundtruth wiring `python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py`, `python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`	Adds `is_l2_normalized` and `add_jitter` to `utils.py`, then imports them into `__main__.py` to introduce `choose_random_queries_with_jitter`, extend the `--queries` CLI argument, and branch on the new `random-jitter` mode.
Fingerprint dataclass and package exports `python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py`	Defines the `Fingerprint` dataclass storing all clustering/PCA fields with `__post_init__` density normalization, and re-exports all public symbols via `__all__` in the package `__init__`.
Fingerprint and dataset I/O `python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`	Implements `load_dataset` (multi-format with optional truncation), `save_fingerprint` (NPZ with empty-array diagonal-fallback markers), and `load_fingerprint` (NPZ deserialization reconstructing `None` PCA entries and returning a `Fingerprint`).
Cluster fingerprint fitting pipeline `python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`	Adds `fit_cluster_stats` running GPU KMeans and per-cluster PCA (with diagonal fallback for small clusters), computing per-cluster density, variance, and residual noise variance, backed by `_run_kmeans` and `_fit_cluster_pca` helpers.
GPU synthetic data and query generation `python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`	Adds `gen_cluster_gpu` (PCA-correlated Gaussian sampling with noise, variance rescaling, optional normalization), density-proportional cluster point allocation, in-memory and async double-buffered streaming `.fbin` generation, and jitter-based query sampling.
Ground truth computation and verification `python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py`	Implements GPU brute-force k-NN helpers, `compute_groundtruth_exact` (streaming per-cluster k-NN merge), `compute_groundtruth_nprobe` (probe-limited batched k-NN), and `verify_groundtruth` (nprobe vs exact recall comparison with timing).
CLI entry point and README `python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`, `python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md`	Defines the `fit`, `generate`, and `verify` subcommand CLI with argument parsing, pipeline dispatch, default nprobes derivation, and output path conventions; adds the full README covering workflow steps, fingerprint schema, YAML registration, pre-fit experiments, caveats, glossary, and Python API list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

improvement, non-breaking, benchmarking, doc

Suggested reviewers

jrbourbeau
cjnolet
lowener

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.76% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly describes the primary change: adding scripts for billion-scale synthetic data generation to cuvs_bench.
Description check	✅ Passed	Description is well-related to the changeset, explaining the new synthesize_dataset modules, files added, and workflow.
Linked Issues check	✅ Passed	The PR successfully implements the objective from issue `#2208` to open-source the BSDG as part of cuvs_bench with complete fit/generate/verify workflow.
Out of Scope Changes check	✅ Passed	All changes are directly aligned with the goal of adding billion-scale synthetic data generation to cuvs_bench; no out-of-scope modifications detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`:
- Around line 116-133: The function `choose_random_queries_with_jitter` always
returns float32 jittered query data (via the return statement calling
`add_jitter` on the float32-casted `sampled` array), but the code that uses this
function's output (around line 389) determines the filename suffix using the
original `dataset.dtype` instead of the actual output dtype. This causes float32
data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for
uint8 inputs), leading to incorrect decoding later. Update the code that
generates the output filename suffix to use float32 as the dtype instead of
`dataset.dtype`, since the jittered queries are always float32 regardless of the
input dataset type.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`:
- Around line 343-351: Add centralized validation logic in the main function
after the parser.parse_args(argv) call and before dispatching to the command
handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or
inline checks to verify that numeric arguments including total_rows, n_queries,
k, nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`:
- Around line 67-71: In the __post_init__ method, before normalizing
cluster_densities, add comprehensive validation to check not only that the sum
is positive but also that all individual density values are non-negative
(greater than or equal to zero), contain no NaN or infinite values, and match
the expected shape or dimensions as documented in the class contract. These
checks should occur after computing the total sum but before performing the
normalization division to prevent silent data corruption in downstream
allocation operations.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`:
- Around line 64-75: The function `_fit_cluster_pca` creates GPU memory
allocations with `residuals_gpu` and `out` but doesn't explicitly free them,
causing GPU memory to accumulate when called repeatedly in a loop (once per
cluster). After converting the results to NumPy arrays using `cp.asnumpy()`,
explicitly delete the GPU objects `residuals_gpu` and `out` using the `del`
statement, and optionally call
`cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is
properly released before the function returns.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`:
- Around line 219-230: The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`:
- Around line 54-60: The compute_groundtruth_exact function and related
functions (around lines 119-126) lack explicit bounds validation for the
parameters k, total_rows, and nprobes. When k exceeds total_rows or other
invalid values are passed, the functions prefill outputs with sentinel values
(-1) which then propagate to recall computations downstream. Add validation
checks at the beginning of compute_groundtruth_exact and the other affected
function to verify that k is less than or equal to total_rows, total_rows is
positive, and nprobes is within valid bounds. When validation fails, raise clear
exceptions that include both the expected constraints and the actual invalid
values provided, following the pattern of providing actionable error messages
with expected-vs-actual comparisons.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`:
- Around line 73-93: Add input validation at the start of the load_dataset
function to ensure sample_size is positive when provided (sample_size must be
greater than 0 to prevent silent data corruption from Python's negative
indexing). Additionally, after loading data in each branch (the .npy block, .pkl
block, and memmap_bin_file block), validate that the resulting numpy array is 2D
and numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 06365399-0de0-4fab-a427-5f11f7d266be

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae6f93 and 8fcb3b2.

⛔ Files ignored due to path filters (4)

python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_bigann.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_falcon.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_wiki.png is excluded by !**/*.png
python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/pipeline.png is excluded by !**/*.png

📒 Files selected for processing (11)

python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md
python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py
python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py

coderabbitai · 2026-06-17T00:48:11Z

+def choose_random_queries_with_jitter(dataset, n_queries, seed=12345):
+    """Pick ``n_queries`` random rows from ``dataset``, add Gaussian jitter at
+    scale ``0.1 * std(sample)``, and re-normalize to unit norm iff the
+    original dataset rows already are.
+    """
+    import numpy as _np
+
+    print("Choosing random vectors from dataset and jittering with noise")
+    rng = _np.random.default_rng(seed)
+    n_rows = dataset.shape[0]
+    # Sort indices so the memmap read is sequential rather than random-access.
+    query_idx = _np.sort(rng.choice(n_rows, size=n_queries, replace=False))
+    sampled = dataset[query_idx, :].astype(_np.float32, copy=True)
+
+    normalize = is_l2_normalized(sampled)
+
+    return add_jitter(sampled, rng, normalize)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: random-jitter writes float32 query data under dataset-typed filename suffix.

Line 128/Line 132 produce float32 jittered queries, but Line 389 still picks suffix from dataset.dtype. For non-float datasets, this can write float32 payloads into .u8bin/.i8bin filenames, which later decode with the wrong dtype.

Proposed fix

- queries_filename = os.path.join( - args.output, "queries" + suffix_from_dtype(dtype) - ) + query_out_dtype = ( + queries.dtype if args.queries == "random-jitter" else dtype + ) + queries_filename = os.path.join( + args.output, "queries" + suffix_from_dtype(query_out_dtype) + )

As per coding guidelines, “Prevent silent data corruption from type coercion and validate that array type coercions are handled safely.”

Also applies to: 388-390

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py` around lines 116 - 133, The function `choose_random_queries_with_jitter` always returns float32 jittered query data (via the return statement calling `add_jitter` on the float32-casted `sampled` array), but the code that uses this function's output (around line 389) determines the filename suffix using the original `dataset.dtype` instead of the actual output dtype. This causes float32 data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for uint8 inputs), leading to incorrect decoding later. Update the code that generates the output filename suffix to use float32 as the dtype instead of `dataset.dtype`, since the jittered queries are always float32 regardless of the input dataset type.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+    args = parser.parse_args(argv)
+    if args.command == "fit":
+        return _cmd_fit(args)
+    if args.command == "generate":
+        return _cmd_generate(args)
+    if args.command == "verify":
+        return _cmd_verify(args)
+    parser.print_help()
+    return 1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Add centralized CLI bounds validation before dispatch.

Several numeric arguments are unbounded (total_rows, n_queries, k, nprobes, sample_size, n_clusters, pca_components). Invalid values can crash downstream (or produce invalid GT silently), and nprobes is not enforced against config.nclusters despite the documented contract.

Proposed guardrail patch

def main(argv: list[str] | None = None) -> int: @@ args = parser.parse_args(argv) + + def _fail(msg: str) -> int: + parser.error(msg) + return 2 + + if args.command == "fit": + if args.sample_size is not None and args.sample_size <= 0: + return _fail( + f"--sample_size must be > 0 when provided (got {args.sample_size})." + ) + if args.n_clusters <= 0: + return _fail(f"--n_clusters must be > 0 (got {args.n_clusters}).") + if args.pca_components <= 0: + return _fail( + f"--pca_components must be > 0 (got {args.pca_components})." + ) + + if args.command in ("generate", "verify"): + if args.total_rows <= 0: + return _fail(f"--total_rows must be > 0 (got {args.total_rows}).") + if args.n_queries <= 0: + return _fail(f"--n_queries must be > 0 (got {args.n_queries}).") + if args.n_queries > args.total_rows: + return _fail( + f"--n_queries ({args.n_queries}) must be <= --total_rows ({args.total_rows}) " + "for sampling without replacement." + ) + if args.k <= 0: + return _fail(f"--k must be > 0 (got {args.k}).") + if args.nprobes is not None and args.nprobes <= 0: + return _fail(f"--nprobes must be > 0 (got {args.nprobes}).") + if args.command == "fit": return _cmd_fit(args) if args.command == "generate": + if args.nprobes is not None: + cfg = load_fingerprint(args.stats, seed=args.seed) + if args.nprobes > cfg.nclusters: + return _fail( + f"--nprobes ({args.nprobes}) must be <= number of clusters " + f"({cfg.nclusters}) in fingerprint." + ) return _cmd_generate(args) if args.command == "verify": + if args.nprobes is not None: + cfg = load_fingerprint(args.stats, seed=args.seed) + if args.nprobes > cfg.nclusters: + return _fail( + f"--nprobes ({args.nprobes}) must be <= number of clusters " + f"({cfg.nclusters}) in fingerprint." + ) return _cmd_verify(args)

As per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py` around lines 343 - 351, Add centralized validation logic in the main function after the parser.parse_args(argv) call and before dispatching to the command handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or inline checks to verify that numeric arguments including total_rows, n_queries, k, nprobes, sample_size, n_clusters, and pca_components have valid bounds. Additionally, ensure that nprobes is validated against config.nclusters to enforce the documented contract. If any argument fails validation, print a clear error message that includes the argument name, expected valid range, and actual value provided, then return 1 to exit early. This prevents invalid values from reaching downstream processing.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+    def __post_init__(self):
+        total = float(self.cluster_densities.sum())
+        if total <= 0.0:
+            raise ValueError("cluster_densities must sum to a positive value")
+        self.cluster_densities = self.cluster_densities / total


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the fingerprint file find . -name "_fingerprint.py" -type f | head -5

Repository: rapidsai/cuvs

Length of output: 123

🏁 Script executed:

# Check the full structure of the file and surrounding context cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py

Repository: rapidsai/cuvs

Length of output: 3252

🏁 Script executed:

# Find how cluster_densities is used downstream rg "cluster_densities" --type py -A 3 -B 3

Repository: rapidsai/cuvs

Length of output: 3727

🏁 Script executed:

# Check for existing tests of Fingerprint class find . -name "test_*.py" -o -name "*_test.py" | xargs rg "Fingerprint" -l

Repository: rapidsai/cuvs

Length of output: 41

🏁 Script executed:

# Look at the full context of where cluster_densities is used in _generate.py sed -n '1,100p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py | head -80

Repository: rapidsai/cuvs

Length of output: 2602

🏁 Script executed:

# Check if there are any guards around the multiplication by cluster_densities rg "cluster_densities" -B 5 -A 5 python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py

Repository: rapidsai/cuvs

Length of output: 451

🏁 Script executed:

# Check what data["densities"] is and how it's constructed rg "densities" -B 3 -A 3 python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -40

Repository: rapidsai/cuvs

Length of output: 1006

🏁 Script executed:

# Check if Fingerprint is directly instantiated in tests find . -path "*/test*" -name "*.py" -exec rg "Fingerprint" {} +

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

# Check the full function where cluster_densities is used sed -n '1,150p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py

Repository: rapidsai/cuvs

Length of output: 5070

🏁 Script executed:

# Verify what happens with negative or non-finite values in the multiplication python3 << 'PY' import numpy as np # Simulate what happens if cluster_densities contains problematic values remaining = 100 total_points = 1000 nclusters = 3 # Case 1: negative density (not caught by current validation) densities_neg = np.array([-0.1, 0.5, 0.6]) # sums to 1.0, passes current check total = float(densities_neg.sum()) print(f"Negative case: sum={total}, passes check: {total > 0}") extra = (densities_neg * remaining).astype(np.int64) print(f"Extra points: {extra} - negative allocation!\n") # Case 2: NaN/Inf (current check may fail) densities_nan = np.array([np.nan, 0.5, 0.5]) total = float(densities_nan.sum()) print(f"NaN case: sum={total}, passes check: {total > 0}") print(f"NaN comparison result: {total <= 0.0}\n") # Case 3: shape mismatch (not caught at all) densities_wrong_shape = np.array([[0.3, 0.3, 0.4]]) # shape (1, 3) instead of (3,) print(f"Shape: {densities_wrong_shape.shape}, normalized: {densities_wrong_shape / densities_wrong_shape.sum()}") PY

Repository: rapidsai/cuvs

Length of output: 264

HIGH: Validate density invariants before normalization.

Only checking sum > 0 allows negative, NaN, Inf, and shape-mismatched densities to pass validation. Negative densities produce silent data corruption in downstream allocation ((config.cluster_densities * remaining).astype(np.int64)); NaN/Inf values can slip through comparisons; and shape mismatches violate the documented contract.

Suggested fix

def __post_init__(self): + if self.cluster_densities.shape != (self.nclusters,): + raise ValueError( + f"cluster_densities shape mismatch: expected ({self.nclusters},), " + f"got {self.cluster_densities.shape}" + ) + if not np.issubdtype(self.cluster_densities.dtype, np.floating): + self.cluster_densities = self.cluster_densities.astype(np.float64) + if not np.isfinite(self.cluster_densities).all(): + raise ValueError("cluster_densities must contain only finite values") + if (self.cluster_densities < 0).any(): + raise ValueError("cluster_densities must be non-negative") total = float(self.cluster_densities.sum()) if total <= 0.0: raise ValueError("cluster_densities must sum to a positive value") self.cluster_densities = self.cluster_densities / total

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py` around lines 67 - 71, In the __post_init__ method, before normalizing cluster_densities, add comprehensive validation to check not only that the sum is positive but also that all individual density values are non-negative (greater than or equal to zero), contain no NaN or infinite values, and match the expected shape or dimensions as documented in the class contract. These checks should occur after computing the total sum but before performing the normalization division to prevent silent data corruption in downstream allocation operations.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+def _fit_cluster_pca(
+    residuals: np.ndarray, n_components: int
+) -> tuple[np.ndarray, np.ndarray]:
+    """Fit a single PCA via cuvs and return ``(components, explained_var)``."""
+    residuals_gpu = cp.asarray(residuals, dtype=cp.float32)
+    params = cuvs_pca.Params(n_components=n_components, copy=True)
+    out = cuvs_pca.fit(params, residuals_gpu)
+    components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32)
+    explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype(
+        np.float32
+    )
+    return components, explained_var


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: GPU memory not freed in loop-called helper

_fit_cluster_pca is called once per cluster (potentially thousands) but doesn't free residuals_gpu or the out object before returning. GPU memory can accumulate between Python GC cycles.

Suggested fix

def _fit_cluster_pca( residuals: np.ndarray, n_components: int ) -> tuple[np.ndarray, np.ndarray]: """Fit a single PCA via cuvs and return ``(components, explained_var)``.""" residuals_gpu = cp.asarray(residuals, dtype=cp.float32) params = cuvs_pca.Params(n_components=n_components, copy=True) out = cuvs_pca.fit(params, residuals_gpu) components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32) explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype( np.float32 ) + del residuals_gpu, out + cp.get_default_memory_pool().free_all_blocks() return components, explained_var

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py` around lines 64 - 75, The function `_fit_cluster_pca` creates GPU memory allocations with `residuals_gpu` and `out` but doesn't explicitly free them, causing GPU memory to accumulate when called repeatedly in a loop (once per cluster). After converting the results to NumPy arrays using `cp.asnumpy()`, explicitly delete the GPU objects `residuals_gpu` and `out` using the `del` statement, and optionally call `cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is properly released before the function returns.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+    def _wait_for_write() -> None:
+        nonlocal write_thread
+        if write_thread is not None:
+            write_thread.join()
+            write_thread = None
+
+    def _flush_async(f, buf_view: np.ndarray) -> None:
+        nonlocal write_thread
+        write_thread = threading.Thread(
+            target=lambda: buf_view.tofile(f), daemon=True
+        )
+        write_thread.start()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Exceptions in async write thread are silently lost

If buf_view.tofile(f) fails (disk full, I/O error), the exception is swallowed by the daemon thread. _wait_for_write() only calls join() without checking for errors, so the code continues and may produce a corrupt/truncated file without any indication.

Have you considered capturing and re-raising exceptions from the write thread?

Suggested fix using exception capture

+ write_exception: BaseException | None = None + def _wait_for_write() -> None: - nonlocal write_thread + nonlocal write_thread, write_exception if write_thread is not None: write_thread.join() write_thread = None + if write_exception is not None: + exc = write_exception + write_exception = None + raise exc + + def _do_write(buf_view: np.ndarray, f) -> None: + nonlocal write_exception + try: + buf_view.tofile(f) + except BaseException as e: + write_exception = e def _flush_async(f, buf_view: np.ndarray) -> None: nonlocal write_thread write_thread = threading.Thread( - target=lambda: buf_view.tofile(f), daemon=True + target=lambda: _do_write(buf_view, f), daemon=True ) write_thread.start()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py` around lines 219 - 230, The daemon thread in _flush_async can fail silently if buf_view.tofile(f) throws an exception (disk full, I/O error), and _wait_for_write() only calls join() without checking for errors, causing the code to continue with a potentially corrupt file. Add a nonlocal variable to capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a try-except block to catch and store any exception, then modify _wait_for_write() to check for captured exceptions after joining the thread and re-raise them if they exist.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+def compute_groundtruth_exact(
+    queries: np.ndarray,
+    total_rows: int,
+    config: Fingerprint,
+    k: int,
+    metric: str = "sqeuclidean",
+) -> Tuple[np.ndarray, np.ndarray, dict]:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Add explicit bounds validation for k, total_rows, and nprobes.

At Line 70 and Line 139, outputs are prefilled with -1; if k is out of bounds (for example, k > total_rows), invalid sentinel IDs can leak into final GT and then into recall computation (used in python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py, Line 76). Please fail fast with clear expected-vs-actual validation.

Proposed fix

def compute_groundtruth_exact( queries: np.ndarray, total_rows: int, config: Fingerprint, k: int, metric: str = "sqeuclidean", ) -> Tuple[np.ndarray, np.ndarray, dict]: + if total_rows <= 0: + raise ValueError( + f"total_rows must be > 0, got total_rows={total_rows}" + ) + if k <= 0 or k > total_rows: + raise ValueError( + f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}" + ) + """Exact brute-force GT, computed by streaming every cluster. @@ def compute_groundtruth_nprobe( @@ ) -> Tuple[np.ndarray, np.ndarray, dict]: + if total_rows <= 0: + raise ValueError( + f"total_rows must be > 0, got total_rows={total_rows}" + ) + if k <= 0 or k > total_rows: + raise ValueError( + f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}" + ) + if nprobes <= 0 or nprobes > config.nclusters: + raise ValueError( + "nprobes must be in [1, config.nclusters], " + f"got nprobes={nprobes}, config.nclusters={config.nclusters}" + ) + """Cheap GT via cluster probing: only the ``nprobes`` nearest clusters

As per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”

Also applies to: 119-126

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py` around lines 54 - 60, The compute_groundtruth_exact function and related functions (around lines 119-126) lack explicit bounds validation for the parameters k, total_rows, and nprobes. When k exceeds total_rows or other invalid values are passed, the functions prefill outputs with sentinel values (-1) which then propagate to recall computations downstream. Add validation checks at the beginning of compute_groundtruth_exact and the other affected function to verify that k is less than or equal to total_rows, total_rows is positive, and nprobes is within valid bounds. When validation fails, raise clear exceptions that include both the expected constraints and the actual invalid values provided, following the pattern of providing actionable error messages with expected-vs-actual comparisons.

Source: Coding guidelines

coderabbitai · 2026-06-17T00:48:11Z

+    if ext == ".npy":
+        data = np.load(path)
+        if sample_size is not None and sample_size < len(data):
+            data = data[:sample_size]
+        return np.ascontiguousarray(data.astype(np.float32))
+
+    if ext == ".pkl":
+        with open(path, "rb") as f:
+            data = pickle.load(f)
+        if not isinstance(data, np.ndarray):
+            data = np.array(data, dtype=np.float32)
+        if sample_size is not None and sample_size < len(data):
+            data = data[:sample_size]
+        return np.ascontiguousarray(data.astype(np.float32))
+
+    # Default: treat as fbin (covers ".fbin" and unknown extensions).
+    # memmap_bin_file auto-detects the legacy uint32 / extended uint64 header.
+    mm = memmap_bin_file(path, dtype, mode="r")
+    if sample_size is not None:
+        mm = mm[:sample_size]
+    return np.ascontiguousarray(mm)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the _io.py file fd -t f "_io.py" python/

Repository: rapidsai/cuvs

Length of output: 112

🏁 Script executed:

# Get the full context of the file around lines 73-93 wc -l python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py

Repository: rapidsai/cuvs

Length of output: 116

🏁 Script executed:

# Read the full function to understand context cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -120

Repository: rapidsai/cuvs

Length of output: 5227

🏁 Script executed:

# Check memmap_bin_file to understand what it returns rg -A 20 "def memmap_bin_file" python/

Repository: rapidsai/cuvs

Length of output: 2220

🏁 Script executed:

# Find callers of load_dataset rg -B 2 -A 2 "load_dataset" python/ --type py

Repository: rapidsai/cuvs

Length of output: 2539

🏁 Script executed:

# Check if there are tests for load_dataset fd -t f "test_*_io.py" python/

Repository: rapidsai/cuvs

Length of output: 39

🏁 Script executed:

# Get more context on the _cmd_fit function to see how data is used cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | sed -n '1,150p'

Repository: rapidsai/cuvs

Length of output: 5596

🏁 Script executed:

# Check if there are any validation calls on the data after load_dataset rg -A 10 "_cmd_fit" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py

Repository: rapidsai/cuvs

Length of output: 688

🏁 Script executed:

# Look for the actual fit function to see what contract it expects rg -B 3 -A 15 "def fit_cluster_stats" python/

Repository: rapidsai/cuvs

Length of output: 1585

🏁 Script executed:

# Check what happens in fit_cluster_stats with invalid data shapes cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py | sed -n '80,150p'

Repository: rapidsai/cuvs

Length of output: 3245

🏁 Script executed:

# Check if sample_size validation happens elsewhere (e.g., argparse) rg -B 5 -A 5 "sample_size" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | head -40

Repository: rapidsai/cuvs

Length of output: 1412

🏁 Script executed:

# Test what happens with negative sample_size python3 << 'PYEOF' import numpy as np # Simulate negative sample_size behavior data = np.arange(100).reshape(10, 10) sample_size = -3 # What the current code does if sample_size is not None and sample_size < len(data): result = data[:sample_size] print(f"sample_size={sample_size}, len(data)={len(data)}") print(f"Condition (sample_size < len(data)): {sample_size < len(data)}") print(f"Result shape: {result.shape}") print(f"Result (last 3 rows removed): {result[-3:]}") PYEOF

Repository: rapidsai/cuvs

Length of output: 280

🏁 Script executed:

# Verify what happens with 1D array in fit_cluster_stats python3 << 'PYEOF' import numpy as np # Simulate 1D array error data_1d = np.arange(100).astype(np.float32) try: n_dim = data_1d.shape[1] except IndexError as e: print(f"1D array error: {type(e).__name__}: {e}") # Simulate object dtype data_obj = np.array(["a", "b", "c"], dtype=object) try: ascontiguous = np.ascontiguousarray(data_obj.astype(np.float32)) print("Object dtype converted successfully (unexpected)") except (ValueError, TypeError) as e: print(f"Object dtype error: {type(e).__name__}: {e}") PYEOF

Repository: rapidsai/cuvs

Length of output: 181

🏁 Script executed:

# Check the actual behavior of the suggested patch - when validation happens relative to slicing python3 << 'PYEOF' import numpy as np # Simulate the current _npy branch behavior vs proposed fix path = "/tmp/test.npy" sample_size = -5 # Create test data test_data = np.arange(100).reshape(10, 10).astype(np.float32) np.save(path, test_data) # Current code (without validation) data = np.load(path) print(f"Loaded shape: {data.shape}") if sample_size is not None and sample_size < len(data): data = data[:sample_size] # NEGATIVE SLICE HAPPENS HERE print(f"After slice with sample_size={sample_size}: shape={data.shape}") print(f"Data was silently corrupted! Expected first {abs(sample_size)} rows, got last {10-abs(sample_size)} rows") # Now check when the suggested validation would run print(f"\nValidation check would run AFTER slicing, so it cannot prevent the corruption") if sample_size is not None and sample_size <= 0: print(f"Would catch: sample_size={sample_size} is invalid") else: print(f"Validation check: passed (data already corrupted)") PYEOF

Repository: rapidsai/cuvs

Length of output: 313

HIGH: Validate dataset shape and sample_size before processing, with sample_size check before slicing.

load_dataset accepts invalid inputs that cause silent data corruption (negative sample_size silently loads the last N rows instead of the first N due to Python's negative indexing) and defers shape errors to fit_cluster_stats (which crashes on 1D arrays with IndexError). Validate sample_size sign before any slicing, and enforce 2D numeric arrays at load boundary with actionable errors.

Suggested fix

+def _validate_sample_size(sample_size: int | None) -> None: + if sample_size is not None and sample_size <= 0: + raise ValueError( + f"sample_size must be > 0, got {sample_size}" + ) + +def _validate_loaded_dataset(data: np.ndarray) -> None: + if not isinstance(data, np.ndarray): + raise TypeError(f"expected numpy.ndarray, got {type(data).__name__}") + if data.ndim != 2: + raise ValueError(f"expected 2D array (n_rows, n_cols), got ndim={data.ndim}") + if data.shape[1] <= 0: + raise ValueError(f"expected n_cols > 0, got shape={data.shape}") + if not np.issubdtype(data.dtype, np.number): + raise TypeError(f"expected numeric dtype, got {data.dtype}") def load_dataset( path: str, sample_size: int | None = None, dtype: np.dtype = np.float32, ) -> np.ndarray: + _validate_sample_size(sample_size) ext = os.path.splitext(path)[1].lower() if ext == ".npy": data = np.load(path) if sample_size is not None and sample_size < len(data): data = data[:sample_size] + _validate_loaded_dataset(data) return np.ascontiguousarray(data.astype(np.float32)) if ext == ".pkl": with open(path, "rb") as f: data = pickle.load(f) if not isinstance(data, np.ndarray): data = np.array(data, dtype=np.float32) if sample_size is not None and sample_size < len(data): data = data[:sample_size] + _validate_loaded_dataset(data) return np.ascontiguousarray(data.astype(np.float32)) mm = memmap_bin_file(path, dtype, mode="r") if sample_size is not None: mm = mm[:sample_size] + _validate_loaded_dataset(mm) return np.ascontiguousarray(mm)

🧰 Tools

🪛 OpenGrep (1.22.0)

[ERROR] 81-81: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.

(coderabbit.deserialization.python-pickle)

🪛 Ruff (0.15.17)

[error] 81-81: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py` around lines 73 - 93, Add input validation at the start of the load_dataset function to ensure sample_size is positive when provided (sample_size must be greater than 0 to prevent silent data corruption from Python's negative indexing). Additionally, after loading data in each branch (the .npy block, .pkl block, and memmap_bin_file block), validate that the resulting numpy array is 2D and numeric before applying sample_size slicing or returning, raising clear and actionable errors if the data is 1D or of invalid type. This ensures shape and type errors are caught at the load boundary rather than deferred to downstream functions like fit_cluster_stats.

Source: Coding guidelines

jinsolp added 11 commits May 19, 2026 22:06

synthetic data gen

11a2b46

Merge branch 'rapidsai:main' into billion-scale-data-gen

c2151ee

uint64 header support based on size

def7526

uint64 support in synthesize data

1e31b47

add experiment result sin readme

788e885

resolve merge conflict

ebd00ea

rm norm flag

fd18762

Merge branch 'rapidsai:main' into billion-scale-data-gen

d4bf7a5

cleanup

c44a245

cleanup and rename

4dce8f9

cleanup readme

8fcb3b2

jinsolp self-assigned this Jun 17, 2026

jinsolp requested a review from a team as a code owner June 17, 2026 00:38

jinsolp added the feature request New feature or request label Jun 17, 2026

github-project-automation Bot added this to Unstructured Data Processing Jun 17, 2026

jinsolp added the non-breaking Introduces a non-breaking change label Jun 17, 2026

coderabbitai Bot reviewed Jun 17, 2026

View reviewed changes

Conversation

jinsolp commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

jinsolp commented Jun 17, 2026

Uh oh!

copy-pr-bot Bot commented Jun 17, 2026

Uh oh!

jinsolp commented Jun 17, 2026

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant