Skip to content

Scripts for Billion-Scale Synthetic Data Generation#2247

Open
jinsolp wants to merge 11 commits into
rapidsai:mainfrom
jinsolp:billion-scale-data-gen
Open

Scripts for Billion-Scale Synthetic Data Generation#2247
jinsolp wants to merge 11 commits into
rapidsai:mainfrom
jinsolp:billion-scale-data-gen

Conversation

@jinsolp

@jinsolp jinsolp commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Closes #2208

Adding billion-scale synthetic data generation scripts to cuvs_bench.

New Files in synthesize_dataset/:

  • __main__.py: fit/generate/verify CLI.
  • _fit.py: KMeans + per-cluster PCA fitting logic.
  • _fingerprint.py: Fingerprint class.
  • _generate.py: per-cluster data generation logic.
  • _ground_truth.py: exact (streaming) and nprobe GT computation.
  • _verify.py: compares nprobe GT vs exact GT to validate nprobes.
  • _io.py — dataset loading (support for npz and pkl files) + fingerprint NPZ save/load logic.
  • README.md + figures/: full workflow guide and synth-vs-real DiskANN validation on Falcon/BigANN/Wiki.

For Reviewers: It would be easier to read through the README.md and review code for each step in that order (starting with __main__.py.

@jinsolp jinsolp self-assigned this Jun 17, 2026
@jinsolp jinsolp requested a review from a team as a code owner June 17, 2026 00:38
@jinsolp jinsolp added the feature request New feature or request label Jun 17, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jinsolp jinsolp added the non-breaking Introduces a non-breaking change label Jun 17, 2026
@jinsolp

jinsolp commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test

@copy-pr-bot

copy-pr-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown

/ok to test

@jinsolp, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@jinsolp

jinsolp commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

/ok to test 8fcb3b2

@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • New Features

    • Added synthesize_dataset module with CLI tool for generating synthetic vector datasets and ground truth at scale for benchmarking
    • Three-command interface: fit (learn dataset fingerprint), generate (create synthetic data/queries/ground truth), and verify (check accuracy)
    • Support for random jitter query generation mode
    • Ground truth computation in exact and approximate modes
  • Documentation

    • Added comprehensive README with workflow, examples, and API reference for synthetic dataset generation

Walkthrough

Introduces a new cuvs_bench.synthesize_dataset package for generating GPU-accelerated billion-scale synthetic ANN benchmark datasets from fitted cluster fingerprints, including exact and nprobe ground-truth computation, a three-subcommand CLI, and full documentation. Also extends generate_groundtruth with a random-jitter query generation mode backed by new shared utility functions.

Changes

Billion-Scale Synthetic Dataset Generator and Jitter Query Support

Layer / File(s) Summary
Shared jitter utilities and generate_groundtruth wiring
python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py, python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
Adds is_l2_normalized and add_jitter to utils.py, then imports them into __main__.py to introduce choose_random_queries_with_jitter, extend the --queries CLI argument, and branch on the new random-jitter mode.
Fingerprint dataclass and package exports
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py, python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py
Defines the Fingerprint dataclass storing all clustering/PCA fields with __post_init__ density normalization, and re-exports all public symbols via __all__ in the package __init__.
Fingerprint and dataset I/O
python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py
Implements load_dataset (multi-format with optional truncation), save_fingerprint (NPZ with empty-array diagonal-fallback markers), and load_fingerprint (NPZ deserialization reconstructing None PCA entries and returning a Fingerprint).
Cluster fingerprint fitting pipeline
python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py
Adds fit_cluster_stats running GPU KMeans and per-cluster PCA (with diagonal fallback for small clusters), computing per-cluster density, variance, and residual noise variance, backed by _run_kmeans and _fit_cluster_pca helpers.
GPU synthetic data and query generation
python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py
Adds gen_cluster_gpu (PCA-correlated Gaussian sampling with noise, variance rescaling, optional normalization), density-proportional cluster point allocation, in-memory and async double-buffered streaming .fbin generation, and jitter-based query sampling.
Ground truth computation and verification
python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py, python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py
Implements GPU brute-force k-NN helpers, compute_groundtruth_exact (streaming per-cluster k-NN merge), compute_groundtruth_nprobe (probe-limited batched k-NN), and verify_groundtruth (nprobe vs exact recall comparison with timing).
CLI entry point and README
python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py, python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md
Defines the fit, generate, and verify subcommand CLI with argument parsing, pipeline dispatch, default nprobes derivation, and output path conventions; adds the full README covering workflow steps, fingerprint schema, YAML registration, pre-fit experiments, caveats, glossary, and Python API list.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

improvement, non-breaking, benchmarking, doc

Suggested reviewers

  • jrbourbeau
  • cjnolet
  • lowener
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 61.76% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly describes the primary change: adding scripts for billion-scale synthetic data generation to cuvs_bench.
Description check ✅ Passed Description is well-related to the changeset, explaining the new synthesize_dataset modules, files added, and workflow.
Linked Issues check ✅ Passed The PR successfully implements the objective from issue #2208 to open-source the BSDG as part of cuvs_bench with complete fit/generate/verify workflow.
Out of Scope Changes check ✅ Passed All changes are directly aligned with the goal of adding billion-scale synthetic data generation to cuvs_bench; no out-of-scope modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py`:
- Around line 116-133: The function `choose_random_queries_with_jitter` always
returns float32 jittered query data (via the return statement calling
`add_jitter` on the float32-casted `sampled` array), but the code that uses this
function's output (around line 389) determines the filename suffix using the
original `dataset.dtype` instead of the actual output dtype. This causes float32
data to be written to filenames with the wrong type suffix (e.g., `.u8bin` for
uint8 inputs), leading to incorrect decoding later. Update the code that
generates the output filename suffix to use float32 as the dtype instead of
`dataset.dtype`, since the jittered queries are always float32 regardless of the
input dataset type.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py`:
- Around line 343-351: Add centralized validation logic in the main function
after the parser.parse_args(argv) call and before dispatching to the command
handlers (_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or
inline checks to verify that numeric arguments including total_rows, n_queries,
k, nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py`:
- Around line 67-71: In the __post_init__ method, before normalizing
cluster_densities, add comprehensive validation to check not only that the sum
is positive but also that all individual density values are non-negative
(greater than or equal to zero), contain no NaN or infinite values, and match
the expected shape or dimensions as documented in the class contract. These
checks should occur after computing the total sum but before performing the
normalization division to prevent silent data corruption in downstream
allocation operations.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py`:
- Around line 64-75: The function `_fit_cluster_pca` creates GPU memory
allocations with `residuals_gpu` and `out` but doesn't explicitly free them,
causing GPU memory to accumulate when called repeatedly in a loop (once per
cluster). After converting the results to NumPy arrays using `cp.asnumpy()`,
explicitly delete the GPU objects `residuals_gpu` and `out` using the `del`
statement, and optionally call
`cp.cuda.stream.get_current_stream().synchronize()` to ensure GPU memory is
properly released before the function returns.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py`:
- Around line 219-230: The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py`:
- Around line 54-60: The compute_groundtruth_exact function and related
functions (around lines 119-126) lack explicit bounds validation for the
parameters k, total_rows, and nprobes. When k exceeds total_rows or other
invalid values are passed, the functions prefill outputs with sentinel values
(-1) which then propagate to recall computations downstream. Add validation
checks at the beginning of compute_groundtruth_exact and the other affected
function to verify that k is less than or equal to total_rows, total_rows is
positive, and nprobes is within valid bounds. When validation fails, raise clear
exceptions that include both the expected constraints and the actual invalid
values provided, following the pattern of providing actionable error messages
with expected-vs-actual comparisons.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py`:
- Around line 73-93: Add input validation at the start of the load_dataset
function to ensure sample_size is positive when provided (sample_size must be
greater than 0 to prevent silent data corruption from Python's negative
indexing). Additionally, after loading data in each branch (the .npy block, .pkl
block, and memmap_bin_file block), validate that the resulting numpy array is 2D
and numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 06365399-0de0-4fab-a427-5f11f7d266be

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae6f93 and 8fcb3b2.

⛔ Files ignored due to path filters (4)
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_bigann.png is excluded by !**/*.png
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_falcon.png is excluded by !**/*.png
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/diskann_wiki.png is excluded by !**/*.png
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/figures/pipeline.png is excluded by !**/*.png
📒 Files selected for processing (11)
  • python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py
  • python/cuvs_bench/cuvs_bench/generate_groundtruth/utils.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/README.md
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/__init__.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py
  • python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py

Comment on lines +116 to +133
def choose_random_queries_with_jitter(dataset, n_queries, seed=12345):
"""Pick ``n_queries`` random rows from ``dataset``, add Gaussian jitter at
scale ``0.1 * std(sample)``, and re-normalize to unit norm iff the
original dataset rows already are.
"""
import numpy as _np

print("Choosing random vectors from dataset and jittering with noise")
rng = _np.random.default_rng(seed)
n_rows = dataset.shape[0]
# Sort indices so the memmap read is sequential rather than random-access.
query_idx = _np.sort(rng.choice(n_rows, size=n_queries, replace=False))
sampled = dataset[query_idx, :].astype(_np.float32, copy=True)

normalize = is_l2_normalized(sampled)

return add_jitter(sampled, rng, normalize)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: random-jitter writes float32 query data under dataset-typed filename suffix.

Line 128/Line 132 produce float32 jittered queries, but Line 389 still picks suffix from dataset.dtype. For non-float datasets, this can write float32 payloads into .u8bin/.i8bin filenames, which later decode with the wrong dtype.

Proposed fix
-        queries_filename = os.path.join(
-            args.output, "queries" + suffix_from_dtype(dtype)
-        )
+        query_out_dtype = (
+            queries.dtype if args.queries == "random-jitter" else dtype
+        )
+        queries_filename = os.path.join(
+            args.output, "queries" + suffix_from_dtype(query_out_dtype)
+        )

As per coding guidelines, “Prevent silent data corruption from type coercion and validate that array type coercions are handled safely.”

Also applies to: 388-390

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/generate_groundtruth/__main__.py` around lines
116 - 133, The function `choose_random_queries_with_jitter` always returns
float32 jittered query data (via the return statement calling `add_jitter` on
the float32-casted `sampled` array), but the code that uses this function's
output (around line 389) determines the filename suffix using the original
`dataset.dtype` instead of the actual output dtype. This causes float32 data to
be written to filenames with the wrong type suffix (e.g., `.u8bin` for uint8
inputs), leading to incorrect decoding later. Update the code that generates the
output filename suffix to use float32 as the dtype instead of `dataset.dtype`,
since the jittered queries are always float32 regardless of the input dataset
type.

Source: Coding guidelines

Comment on lines +343 to +351
args = parser.parse_args(argv)
if args.command == "fit":
return _cmd_fit(args)
if args.command == "generate":
return _cmd_generate(args)
if args.command == "verify":
return _cmd_verify(args)
parser.print_help()
return 1

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Add centralized CLI bounds validation before dispatch.

Several numeric arguments are unbounded (total_rows, n_queries, k, nprobes, sample_size, n_clusters, pca_components). Invalid values can crash downstream (or produce invalid GT silently), and nprobes is not enforced against config.nclusters despite the documented contract.

Proposed guardrail patch
 def main(argv: list[str] | None = None) -> int:
@@
     args = parser.parse_args(argv)
+
+    def _fail(msg: str) -> int:
+        parser.error(msg)
+        return 2
+
+    if args.command == "fit":
+        if args.sample_size is not None and args.sample_size <= 0:
+            return _fail(
+                f"--sample_size must be > 0 when provided (got {args.sample_size})."
+            )
+        if args.n_clusters <= 0:
+            return _fail(f"--n_clusters must be > 0 (got {args.n_clusters}).")
+        if args.pca_components <= 0:
+            return _fail(
+                f"--pca_components must be > 0 (got {args.pca_components})."
+            )
+
+    if args.command in ("generate", "verify"):
+        if args.total_rows <= 0:
+            return _fail(f"--total_rows must be > 0 (got {args.total_rows}).")
+        if args.n_queries <= 0:
+            return _fail(f"--n_queries must be > 0 (got {args.n_queries}).")
+        if args.n_queries > args.total_rows:
+            return _fail(
+                f"--n_queries ({args.n_queries}) must be <= --total_rows ({args.total_rows}) "
+                "for sampling without replacement."
+            )
+        if args.k <= 0:
+            return _fail(f"--k must be > 0 (got {args.k}).")
+        if args.nprobes is not None and args.nprobes <= 0:
+            return _fail(f"--nprobes must be > 0 (got {args.nprobes}).")
+
     if args.command == "fit":
         return _cmd_fit(args)
     if args.command == "generate":
+        if args.nprobes is not None:
+            cfg = load_fingerprint(args.stats, seed=args.seed)
+            if args.nprobes > cfg.nclusters:
+                return _fail(
+                    f"--nprobes ({args.nprobes}) must be <= number of clusters "
+                    f"({cfg.nclusters}) in fingerprint."
+                )
         return _cmd_generate(args)
     if args.command == "verify":
+        if args.nprobes is not None:
+            cfg = load_fingerprint(args.stats, seed=args.seed)
+            if args.nprobes > cfg.nclusters:
+                return _fail(
+                    f"--nprobes ({args.nprobes}) must be <= number of clusters "
+                    f"({cfg.nclusters}) in fingerprint."
+                )
         return _cmd_verify(args)

As per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py` around lines 343
- 351, Add centralized validation logic in the main function after the
parser.parse_args(argv) call and before dispatching to the command handlers
(_cmd_fit, _cmd_generate, _cmd_verify). Create a validation function or inline
checks to verify that numeric arguments including total_rows, n_queries, k,
nprobes, sample_size, n_clusters, and pca_components have valid bounds.
Additionally, ensure that nprobes is validated against config.nclusters to
enforce the documented contract. If any argument fails validation, print a clear
error message that includes the argument name, expected valid range, and actual
value provided, then return 1 to exit early. This prevents invalid values from
reaching downstream processing.

Source: Coding guidelines

Comment on lines +67 to +71
def __post_init__(self):
total = float(self.cluster_densities.sum())
if total <= 0.0:
raise ValueError("cluster_densities must sum to a positive value")
self.cluster_densities = self.cluster_densities / total

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Locate and examine the fingerprint file
find . -name "_fingerprint.py" -type f | head -5

Repository: rapidsai/cuvs

Length of output: 123


🏁 Script executed:

# Check the full structure of the file and surrounding context
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py

Repository: rapidsai/cuvs

Length of output: 3252


🏁 Script executed:

# Find how cluster_densities is used downstream
rg "cluster_densities" --type py -A 3 -B 3

Repository: rapidsai/cuvs

Length of output: 3727


🏁 Script executed:

# Check for existing tests of Fingerprint class
find . -name "test_*.py" -o -name "*_test.py" | xargs rg "Fingerprint" -l

Repository: rapidsai/cuvs

Length of output: 41


🏁 Script executed:

# Look at the full context of where cluster_densities is used in _generate.py
sed -n '1,100p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py | head -80

Repository: rapidsai/cuvs

Length of output: 2602


🏁 Script executed:

# Check if there are any guards around the multiplication by cluster_densities
rg "cluster_densities" -B 5 -A 5 python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py

Repository: rapidsai/cuvs

Length of output: 451


🏁 Script executed:

# Check what data["densities"] is and how it's constructed
rg "densities" -B 3 -A 3 python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -40

Repository: rapidsai/cuvs

Length of output: 1006


🏁 Script executed:

# Check if Fingerprint is directly instantiated in tests
find . -path "*/test*" -name "*.py" -exec rg "Fingerprint" {} +

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Check the full function where cluster_densities is used
sed -n '1,150p' python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py

Repository: rapidsai/cuvs

Length of output: 5070


🏁 Script executed:

# Verify what happens with negative or non-finite values in the multiplication
python3 << 'PY'
import numpy as np

# Simulate what happens if cluster_densities contains problematic values
remaining = 100
total_points = 1000
nclusters = 3

# Case 1: negative density (not caught by current validation)
densities_neg = np.array([-0.1, 0.5, 0.6])  # sums to 1.0, passes current check
total = float(densities_neg.sum())
print(f"Negative case: sum={total}, passes check: {total > 0}")
extra = (densities_neg * remaining).astype(np.int64)
print(f"Extra points: {extra} - negative allocation!\n")

# Case 2: NaN/Inf (current check may fail)
densities_nan = np.array([np.nan, 0.5, 0.5])
total = float(densities_nan.sum())
print(f"NaN case: sum={total}, passes check: {total > 0}")
print(f"NaN comparison result: {total <= 0.0}\n")

# Case 3: shape mismatch (not caught at all)
densities_wrong_shape = np.array([[0.3, 0.3, 0.4]])  # shape (1, 3) instead of (3,)
print(f"Shape: {densities_wrong_shape.shape}, normalized: {densities_wrong_shape / densities_wrong_shape.sum()}")
PY

Repository: rapidsai/cuvs

Length of output: 264


HIGH: Validate density invariants before normalization.

Only checking sum > 0 allows negative, NaN, Inf, and shape-mismatched densities to pass validation. Negative densities produce silent data corruption in downstream allocation ((config.cluster_densities * remaining).astype(np.int64)); NaN/Inf values can slip through comparisons; and shape mismatches violate the documented contract.

Suggested fix
 def __post_init__(self):
+    if self.cluster_densities.shape != (self.nclusters,):
+        raise ValueError(
+            f"cluster_densities shape mismatch: expected ({self.nclusters},), "
+            f"got {self.cluster_densities.shape}"
+        )
+    if not np.issubdtype(self.cluster_densities.dtype, np.floating):
+        self.cluster_densities = self.cluster_densities.astype(np.float64)
+    if not np.isfinite(self.cluster_densities).all():
+        raise ValueError("cluster_densities must contain only finite values")
+    if (self.cluster_densities < 0).any():
+        raise ValueError("cluster_densities must be non-negative")
     total = float(self.cluster_densities.sum())
     if total <= 0.0:
         raise ValueError("cluster_densities must sum to a positive value")
     self.cluster_densities = self.cluster_densities / total
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fingerprint.py` around lines
67 - 71, In the __post_init__ method, before normalizing cluster_densities, add
comprehensive validation to check not only that the sum is positive but also
that all individual density values are non-negative (greater than or equal to
zero), contain no NaN or infinite values, and match the expected shape or
dimensions as documented in the class contract. These checks should occur after
computing the total sum but before performing the normalization division to
prevent silent data corruption in downstream allocation operations.

Source: Coding guidelines

Comment on lines +64 to +75
def _fit_cluster_pca(
residuals: np.ndarray, n_components: int
) -> tuple[np.ndarray, np.ndarray]:
"""Fit a single PCA via cuvs and return ``(components, explained_var)``."""
residuals_gpu = cp.asarray(residuals, dtype=cp.float32)
params = cuvs_pca.Params(n_components=n_components, copy=True)
out = cuvs_pca.fit(params, residuals_gpu)
components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32)
explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype(
np.float32
)
return components, explained_var

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: GPU memory not freed in loop-called helper

_fit_cluster_pca is called once per cluster (potentially thousands) but doesn't free residuals_gpu or the out object before returning. GPU memory can accumulate between Python GC cycles.

Suggested fix
 def _fit_cluster_pca(
     residuals: np.ndarray, n_components: int
 ) -> tuple[np.ndarray, np.ndarray]:
     """Fit a single PCA via cuvs and return ``(components, explained_var)``."""
     residuals_gpu = cp.asarray(residuals, dtype=cp.float32)
     params = cuvs_pca.Params(n_components=n_components, copy=True)
     out = cuvs_pca.fit(params, residuals_gpu)
     components = cp.asnumpy(cp.asarray(out.components)).astype(np.float32)
     explained_var = cp.asnumpy(cp.asarray(out.explained_var)).astype(
         np.float32
     )
+    del residuals_gpu, out
+    cp.get_default_memory_pool().free_all_blocks()
     return components, explained_var
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py` around lines 64 -
75, The function `_fit_cluster_pca` creates GPU memory allocations with
`residuals_gpu` and `out` but doesn't explicitly free them, causing GPU memory
to accumulate when called repeatedly in a loop (once per cluster). After
converting the results to NumPy arrays using `cp.asnumpy()`, explicitly delete
the GPU objects `residuals_gpu` and `out` using the `del` statement, and
optionally call `cp.cuda.stream.get_current_stream().synchronize()` to ensure
GPU memory is properly released before the function returns.

Source: Coding guidelines

Comment on lines +219 to +230
def _wait_for_write() -> None:
nonlocal write_thread
if write_thread is not None:
write_thread.join()
write_thread = None

def _flush_async(f, buf_view: np.ndarray) -> None:
nonlocal write_thread
write_thread = threading.Thread(
target=lambda: buf_view.tofile(f), daemon=True
)
write_thread.start()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Exceptions in async write thread are silently lost

If buf_view.tofile(f) fails (disk full, I/O error), the exception is swallowed by the daemon thread. _wait_for_write() only calls join() without checking for errors, so the code continues and may produce a corrupt/truncated file without any indication.

Have you considered capturing and re-raising exceptions from the write thread?

Suggested fix using exception capture
+    write_exception: BaseException | None = None
+
     def _wait_for_write() -> None:
-        nonlocal write_thread
+        nonlocal write_thread, write_exception
         if write_thread is not None:
             write_thread.join()
             write_thread = None
+            if write_exception is not None:
+                exc = write_exception
+                write_exception = None
+                raise exc
+
+    def _do_write(buf_view: np.ndarray, f) -> None:
+        nonlocal write_exception
+        try:
+            buf_view.tofile(f)
+        except BaseException as e:
+            write_exception = e

     def _flush_async(f, buf_view: np.ndarray) -> None:
         nonlocal write_thread
         write_thread = threading.Thread(
-            target=lambda: buf_view.tofile(f), daemon=True
+            target=lambda: _do_write(buf_view, f), daemon=True
         )
         write_thread.start()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_generate.py` around lines
219 - 230, The daemon thread in _flush_async can fail silently if
buf_view.tofile(f) throws an exception (disk full, I/O error), and
_wait_for_write() only calls join() without checking for errors, causing the
code to continue with a potentially corrupt file. Add a nonlocal variable to
capture exceptions from the write thread, wrap the buf_view.tofile(f) call in a
try-except block to catch and store any exception, then modify _wait_for_write()
to check for captured exceptions after joining the thread and re-raise them if
they exist.

Source: Coding guidelines

Comment on lines +54 to +60
def compute_groundtruth_exact(
queries: np.ndarray,
total_rows: int,
config: Fingerprint,
k: int,
metric: str = "sqeuclidean",
) -> Tuple[np.ndarray, np.ndarray, dict]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

HIGH: Add explicit bounds validation for k, total_rows, and nprobes.

At Line 70 and Line 139, outputs are prefilled with -1; if k is out of bounds (for example, k > total_rows), invalid sentinel IDs can leak into final GT and then into recall computation (used in python/cuvs_bench/cuvs_bench/synthesize_dataset/_verify.py, Line 76). Please fail fast with clear expected-vs-actual validation.

Proposed fix
 def compute_groundtruth_exact(
     queries: np.ndarray,
     total_rows: int,
     config: Fingerprint,
     k: int,
     metric: str = "sqeuclidean",
 ) -> Tuple[np.ndarray, np.ndarray, dict]:
+    if total_rows <= 0:
+        raise ValueError(
+            f"total_rows must be > 0, got total_rows={total_rows}"
+        )
+    if k <= 0 or k > total_rows:
+        raise ValueError(
+            f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}"
+        )
+
     """Exact brute-force GT, computed by streaming every cluster.
@@
 def compute_groundtruth_nprobe(
@@
 ) -> Tuple[np.ndarray, np.ndarray, dict]:
+    if total_rows <= 0:
+        raise ValueError(
+            f"total_rows must be > 0, got total_rows={total_rows}"
+        )
+    if k <= 0 or k > total_rows:
+        raise ValueError(
+            f"k must be in [1, total_rows], got k={k}, total_rows={total_rows}"
+        )
+    if nprobes <= 0 or nprobes > config.nclusters:
+        raise ValueError(
+            "nprobes must be in [1, config.nclusters], "
+            f"got nprobes={nprobes}, config.nclusters={config.nclusters}"
+        )
+
     """Cheap GT via cluster probing: only the ``nprobes`` nearest clusters

As per coding guidelines, “Ensure missing validation does not cause crashes on invalid input through proper size/type checks” and “Provide clear and actionable error messages that include expected vs actual values where helpful.”

Also applies to: 119-126

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_ground_truth.py` around
lines 54 - 60, The compute_groundtruth_exact function and related functions
(around lines 119-126) lack explicit bounds validation for the parameters k,
total_rows, and nprobes. When k exceeds total_rows or other invalid values are
passed, the functions prefill outputs with sentinel values (-1) which then
propagate to recall computations downstream. Add validation checks at the
beginning of compute_groundtruth_exact and the other affected function to verify
that k is less than or equal to total_rows, total_rows is positive, and nprobes
is within valid bounds. When validation fails, raise clear exceptions that
include both the expected constraints and the actual invalid values provided,
following the pattern of providing actionable error messages with
expected-vs-actual comparisons.

Source: Coding guidelines

Comment on lines +73 to +93
if ext == ".npy":
data = np.load(path)
if sample_size is not None and sample_size < len(data):
data = data[:sample_size]
return np.ascontiguousarray(data.astype(np.float32))

if ext == ".pkl":
with open(path, "rb") as f:
data = pickle.load(f)
if not isinstance(data, np.ndarray):
data = np.array(data, dtype=np.float32)
if sample_size is not None and sample_size < len(data):
data = data[:sample_size]
return np.ascontiguousarray(data.astype(np.float32))

# Default: treat as fbin (covers ".fbin" and unknown extensions).
# memmap_bin_file auto-detects the legacy uint32 / extended uint64 header.
mm = memmap_bin_file(path, dtype, mode="r")
if sample_size is not None:
mm = mm[:sample_size]
return np.ascontiguousarray(mm)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and examine the _io.py file
fd -t f "_io.py" python/

Repository: rapidsai/cuvs

Length of output: 112


🏁 Script executed:

# Get the full context of the file around lines 73-93
wc -l python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py

Repository: rapidsai/cuvs

Length of output: 116


🏁 Script executed:

# Read the full function to understand context
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py | head -120

Repository: rapidsai/cuvs

Length of output: 5227


🏁 Script executed:

# Check memmap_bin_file to understand what it returns
rg -A 20 "def memmap_bin_file" python/

Repository: rapidsai/cuvs

Length of output: 2220


🏁 Script executed:

# Find callers of load_dataset
rg -B 2 -A 2 "load_dataset" python/ --type py

Repository: rapidsai/cuvs

Length of output: 2539


🏁 Script executed:

# Check if there are tests for load_dataset
fd -t f "test_*_io.py" python/

Repository: rapidsai/cuvs

Length of output: 39


🏁 Script executed:

# Get more context on the _cmd_fit function to see how data is used
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | sed -n '1,150p'

Repository: rapidsai/cuvs

Length of output: 5596


🏁 Script executed:

# Check if there are any validation calls on the data after load_dataset
rg -A 10 "_cmd_fit" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py

Repository: rapidsai/cuvs

Length of output: 688


🏁 Script executed:

# Look for the actual fit function to see what contract it expects
rg -B 3 -A 15 "def fit_cluster_stats" python/

Repository: rapidsai/cuvs

Length of output: 1585


🏁 Script executed:

# Check what happens in fit_cluster_stats with invalid data shapes
cat -n python/cuvs_bench/cuvs_bench/synthesize_dataset/_fit.py | sed -n '80,150p'

Repository: rapidsai/cuvs

Length of output: 3245


🏁 Script executed:

# Check if sample_size validation happens elsewhere (e.g., argparse)
rg -B 5 -A 5 "sample_size" python/cuvs_bench/cuvs_bench/synthesize_dataset/__main__.py | head -40

Repository: rapidsai/cuvs

Length of output: 1412


🏁 Script executed:

# Test what happens with negative sample_size
python3 << 'PYEOF'
import numpy as np

# Simulate negative sample_size behavior
data = np.arange(100).reshape(10, 10)
sample_size = -3

# What the current code does
if sample_size is not None and sample_size < len(data):
    result = data[:sample_size]
    print(f"sample_size={sample_size}, len(data)={len(data)}")
    print(f"Condition (sample_size < len(data)): {sample_size < len(data)}")
    print(f"Result shape: {result.shape}")
    print(f"Result (last 3 rows removed): {result[-3:]}")
PYEOF

Repository: rapidsai/cuvs

Length of output: 280


🏁 Script executed:

# Verify what happens with 1D array in fit_cluster_stats
python3 << 'PYEOF'
import numpy as np

# Simulate 1D array error
data_1d = np.arange(100).astype(np.float32)
try:
    n_dim = data_1d.shape[1]
except IndexError as e:
    print(f"1D array error: {type(e).__name__}: {e}")

# Simulate object dtype
data_obj = np.array(["a", "b", "c"], dtype=object)
try:
    ascontiguous = np.ascontiguousarray(data_obj.astype(np.float32))
    print("Object dtype converted successfully (unexpected)")
except (ValueError, TypeError) as e:
    print(f"Object dtype error: {type(e).__name__}: {e}")
PYEOF

Repository: rapidsai/cuvs

Length of output: 181


🏁 Script executed:

# Check the actual behavior of the suggested patch - when validation happens relative to slicing
python3 << 'PYEOF'
import numpy as np

# Simulate the current _npy branch behavior vs proposed fix
path = "/tmp/test.npy"
sample_size = -5

# Create test data
test_data = np.arange(100).reshape(10, 10).astype(np.float32)
np.save(path, test_data)

# Current code (without validation)
data = np.load(path)
print(f"Loaded shape: {data.shape}")
if sample_size is not None and sample_size < len(data):
    data = data[:sample_size]  # NEGATIVE SLICE HAPPENS HERE
print(f"After slice with sample_size={sample_size}: shape={data.shape}")
print(f"Data was silently corrupted! Expected first {abs(sample_size)} rows, got last {10-abs(sample_size)} rows")

# Now check when the suggested validation would run
print(f"\nValidation check would run AFTER slicing, so it cannot prevent the corruption")
if sample_size is not None and sample_size <= 0:
    print(f"Would catch: sample_size={sample_size} is invalid")
else:
    print(f"Validation check: passed (data already corrupted)")
PYEOF

Repository: rapidsai/cuvs

Length of output: 313


HIGH: Validate dataset shape and sample_size before processing, with sample_size check before slicing.

load_dataset accepts invalid inputs that cause silent data corruption (negative sample_size silently loads the last N rows instead of the first N due to Python's negative indexing) and defers shape errors to fit_cluster_stats (which crashes on 1D arrays with IndexError). Validate sample_size sign before any slicing, and enforce 2D numeric arrays at load boundary with actionable errors.

Suggested fix
+def _validate_sample_size(sample_size: int | None) -> None:
+    if sample_size is not None and sample_size <= 0:
+        raise ValueError(
+            f"sample_size must be > 0, got {sample_size}"
+        )
+
+def _validate_loaded_dataset(data: np.ndarray) -> None:
+    if not isinstance(data, np.ndarray):
+        raise TypeError(f"expected numpy.ndarray, got {type(data).__name__}")
+    if data.ndim != 2:
+        raise ValueError(f"expected 2D array (n_rows, n_cols), got ndim={data.ndim}")
+    if data.shape[1] <= 0:
+        raise ValueError(f"expected n_cols > 0, got shape={data.shape}")
+    if not np.issubdtype(data.dtype, np.number):
+        raise TypeError(f"expected numeric dtype, got {data.dtype}")

 def load_dataset(
     path: str,
     sample_size: int | None = None,
     dtype: np.dtype = np.float32,
 ) -> np.ndarray:
+    _validate_sample_size(sample_size)
     ext = os.path.splitext(path)[1].lower()

     if ext == ".npy":
         data = np.load(path)
         if sample_size is not None and sample_size < len(data):
             data = data[:sample_size]
+        _validate_loaded_dataset(data)
         return np.ascontiguousarray(data.astype(np.float32))

     if ext == ".pkl":
         with open(path, "rb") as f:
             data = pickle.load(f)
         if not isinstance(data, np.ndarray):
             data = np.array(data, dtype=np.float32)
         if sample_size is not None and sample_size < len(data):
             data = data[:sample_size]
+        _validate_loaded_dataset(data)
         return np.ascontiguousarray(data.astype(np.float32))

     mm = memmap_bin_file(path, dtype, mode="r")
     if sample_size is not None:
         mm = mm[:sample_size]
+    _validate_loaded_dataset(mm)
     return np.ascontiguousarray(mm)
🧰 Tools
🪛 OpenGrep (1.22.0)

[ERROR] 81-81: pickle.load/loads deserializes arbitrary Python objects and can execute arbitrary code. Use a safe format like JSON instead.

(coderabbit.deserialization.python-pickle)

🪛 Ruff (0.15.17)

[error] 81-81: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue

(S301)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@python/cuvs_bench/cuvs_bench/synthesize_dataset/_io.py` around lines 73 - 93,
Add input validation at the start of the load_dataset function to ensure
sample_size is positive when provided (sample_size must be greater than 0 to
prevent silent data corruption from Python's negative indexing). Additionally,
after loading data in each branch (the .npy block, .pkl block, and
memmap_bin_file block), validate that the resulting numpy array is 2D and
numeric before applying sample_size slicing or returning, raising clear and
actionable errors if the data is 1D or of invalid type. This ensures shape and
type errors are caught at the load boundary rather than deferred to downstream
functions like fit_cluster_stats.

Source: Coding guidelines

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[FEA] BSDG: Open source the billion-scale synthetic generator

1 participant