Centralize shared utilities across benchmark backends#2040
Centralize shared utilities across benchmark backends#2040jnke2016 wants to merge 4 commits intorapidsai:mainfrom
Conversation
…ing across Python-native backends
…nfiguration, gather_algorithm_configs) from CppGBenchConfigLoader to base ConfigLoader class.
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughIntroduces a backend utility function to load vectors from binary files with NumPy dtype inference based on file extension, and refactors shared YAML/dataset/algorithm configuration logic from Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@python/cuvs_bench/cuvs_bench/backends/utils.py`:
- Around line 57-58: The code currently assigns dtype = _DTYPE_FOR_EXT.get(ext,
np.float32) which silently defaults unknown file extensions to float32; change
this to fail fast by looking up ext in _DTYPE_FOR_EXT and raising a clear
ValueError (including ext and path) when not found instead of defaulting. Update
the logic around ext and dtype (the ext variable and _DTYPE_FOR_EXT lookup) so
unknown suffixes raise an exception with a descriptive message to prevent
corrupted benchmark inputs.
- Around line 60-66: The code that reads binary matrices (reading n_rows/n_cols
and payload) must validate header and payload lengths and ensure subset_size is
non-negative and within bounds: check that f.read(4) calls returned 4 bytes
before converting to uint32 and raise a ValueError if not; validate subset_size
>= 0 and if provided clamp it to n_rows but error if it's > n_rows; compute
expected_bytes = n_rows * n_cols * np.dtype(dtype).itemsize and verify f.read
returns exactly that many bytes (or at least enough for the requested subset)
before calling np.frombuffer, raising a clear ValueError if sizes mismatch;
update the logic around n_rows, n_cols, subset_size, dtype, np.frombuffer and
f.read to use these checks so malformed files produce actionable errors rather
than cryptic crashes.
In `@python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py`:
- Around line 234-243: The code silently ignores invalid algorithm_configuration
values when they are neither files nor directories; update the block handling
algorithm_configuration (the variable algorithm_configuration that populates
algos_conf_fs) to detect the invalid path and raise a clear exception (e.g.,
raise ValueError) or log-and-raise with the provided path and its type so
callers know the override is invalid instead of silently falling back to
defaults; implement this in the same conditional after the
os.path.isdir/os.path.isfile checks where algos_conf_fs is modified.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4d1822e3-7adb-4f35-a054-0962f282eec1
📒 Files selected for processing (2)
python/cuvs_bench/cuvs_bench/backends/utils.pypython/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py
| ext = os.path.splitext(path)[1].lower() | ||
| dtype = _DTYPE_FOR_EXT.get(ext, np.float32) |
There was a problem hiding this comment.
Fail fast on unsupported vector file extensions.
Line 58 silently defaults unknown suffixes to float32, which can corrupt benchmark inputs when a filename is mistyped.
Proposed fix
def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
@@
ext = os.path.splitext(path)[1].lower()
- dtype = _DTYPE_FOR_EXT.get(ext, np.float32)
+ if ext not in _DTYPE_FOR_EXT:
+ raise ValueError(
+ f"Unsupported vector file extension '{ext}'. "
+ f"Expected one of: {', '.join(sorted(_DTYPE_FOR_EXT.keys()))}"
+ )
+ dtype = _DTYPE_FOR_EXT[ext]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ext = os.path.splitext(path)[1].lower() | |
| dtype = _DTYPE_FOR_EXT.get(ext, np.float32) | |
| ext = os.path.splitext(path)[1].lower() | |
| if ext not in _DTYPE_FOR_EXT: | |
| raise ValueError( | |
| f"Unsupported vector file extension '{ext}'. " | |
| f"Expected one of: {', '.join(sorted(_DTYPE_FOR_EXT.keys()))}" | |
| ) | |
| dtype = _DTYPE_FOR_EXT[ext] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 57 - 58, The
code currently assigns dtype = _DTYPE_FOR_EXT.get(ext, np.float32) which
silently defaults unknown file extensions to float32; change this to fail fast
by looking up ext in _DTYPE_FOR_EXT and raising a clear ValueError (including
ext and path) when not found instead of defaulting. Update the logic around ext
and dtype (the ext variable and _DTYPE_FOR_EXT lookup) so unknown suffixes raise
an exception with a descriptive message to prevent corrupted benchmark inputs.
| n_rows = int(np.frombuffer(f.read(4), dtype=np.uint32)[0]) | ||
| n_cols = int(np.frombuffer(f.read(4), dtype=np.uint32)[0]) | ||
| if subset_size is not None: | ||
| n_rows = min(n_rows, subset_size) | ||
| raw = f.read(n_rows * n_cols * np.dtype(dtype).itemsize) | ||
| data = np.frombuffer(raw, dtype=dtype) | ||
| return data.reshape(n_rows, n_cols) |
There was a problem hiding this comment.
Validate file header/payload size and subset_size bounds.
Lines 60-66 assume complete header/payload and accept negative subset_size, which leads to non-actionable errors on bad input.
Proposed fix
def load_vectors(path: str, subset_size: Optional[int] = None) -> np.ndarray:
@@
+ if subset_size is not None and subset_size < 0:
+ raise ValueError("subset_size must be >= 0")
+
with open(path, "rb") as f:
- n_rows = int(np.frombuffer(f.read(4), dtype=np.uint32)[0])
- n_cols = int(np.frombuffer(f.read(4), dtype=np.uint32)[0])
+ header = f.read(8)
+ if len(header) != 8:
+ raise ValueError(f"Invalid vector file header in '{path}'")
+ n_rows, n_cols = np.frombuffer(header, dtype=np.uint32, count=2)
+ n_rows, n_cols = int(n_rows), int(n_cols)
if subset_size is not None:
n_rows = min(n_rows, subset_size)
- raw = f.read(n_rows * n_cols * np.dtype(dtype).itemsize)
- data = np.frombuffer(raw, dtype=dtype)
+ expected = n_rows * n_cols * np.dtype(dtype).itemsize
+ raw = f.read(expected)
+ if len(raw) != expected:
+ raise ValueError(
+ f"Unexpected payload size in '{path}': expected {expected} bytes, got {len(raw)}"
+ )
+ data = np.frombuffer(raw, dtype=dtype, count=n_rows * n_cols)
return data.reshape(n_rows, n_cols)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuvs_bench/cuvs_bench/backends/utils.py` around lines 60 - 66, The
code that reads binary matrices (reading n_rows/n_cols and payload) must
validate header and payload lengths and ensure subset_size is non-negative and
within bounds: check that f.read(4) calls returned 4 bytes before converting to
uint32 and raise a ValueError if not; validate subset_size >= 0 and if provided
clamp it to n_rows but error if it's > n_rows; compute expected_bytes = n_rows *
n_cols * np.dtype(dtype).itemsize and verify f.read returns exactly that many
bytes (or at least enough for the requested subset) before calling
np.frombuffer, raising a clear ValueError if sizes mismatch; update the logic
around n_rows, n_cols, subset_size, dtype, np.frombuffer and f.read to use these
checks so malformed files produce actionable errors rather than cryptic crashes.
| if algorithm_configuration: | ||
| if os.path.isdir(algorithm_configuration): | ||
| algos_conf_fs += [ | ||
| os.path.join(algorithm_configuration, f) | ||
| for f in os.listdir(algorithm_configuration) | ||
| if ".json" not in f | ||
| ] | ||
| elif os.path.isfile(algorithm_configuration): | ||
| algos_conf_fs.append(algorithm_configuration) | ||
| return algos_conf_fs |
There was a problem hiding this comment.
Do not silently ignore invalid algorithm_configuration paths.
Line 234 accepts an override input, but when it is neither a file nor a directory, the code currently falls back silently to defaults.
Proposed fix
if algorithm_configuration:
if os.path.isdir(algorithm_configuration):
algos_conf_fs += [
os.path.join(algorithm_configuration, f)
for f in os.listdir(algorithm_configuration)
if ".json" not in f
]
elif os.path.isfile(algorithm_configuration):
algos_conf_fs.append(algorithm_configuration)
+ else:
+ raise FileNotFoundError(
+ f"algorithm_configuration path does not exist: "
+ f"{algorithm_configuration}"
+ )
return algos_conf_fs🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@python/cuvs_bench/cuvs_bench/orchestrator/config_loaders.py` around lines 234
- 243, The code silently ignores invalid algorithm_configuration values when
they are neither files nor directories; update the block handling
algorithm_configuration (the variable algorithm_configuration that populates
algos_conf_fs) to detect the invalid path and raise a clear exception (e.g.,
raise ValueError) or log-and-raise with the provided path and its type so
callers know the override is invalid instead of silently falling back to
defaults; implement this in the same conditional after the
os.path.isdir/os.path.isfile checks where algos_conf_fs is modified.
…idate header/payload size, warn on invalid algorithm config paths.
Summary
This PR centralizes duplicated logic identified during review of the OpenSearch and Elasticsearch backend PRs. Both backends independently reimplemented common operations (vector file loading, YAML parsing, dataset lookup) that should be shared across all backends.
Motivation
With the pluggable backend architecture, each new Python-native backend needs to load binary vector files, parse dataset YAML configs, and look up datasets by name. Without centralized utilities, each backend reimplements these independently, leading to:
Changes
1. Shared vector loading utility (
backends/utils.py)Added a
load_vectors()function that handles all supported binary formats (.fbin,.f16bin,.u8bin,.i8bin,.ibin) with optionalsubset_sizesupport. Python-native backends (OpenSearch, Elasticsearch) can import this instead of writing their own file parsers. The C++ backend is unaffected since it passes file paths to the subprocess.2. Promoted config-loading methods to base
ConfigLoaderclassMoved
load_yaml_file,get_dataset_configuration, andgather_algorithm_configsfromCppGBenchConfigLoaderto the baseConfigLoaderclass. These were being reimplemented by the OpenSearch config loader, withload_yaml_fileandget_dataset_configurationalso duplicated in the Elasticsearch loader. They are now inherited automatically so new config loaders can callself.load_yaml_file()without reimplementing.