Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling by scal444 · Pull Request #182 · NVIDIA-Digital-Bio/nvMolKit

scal444 · 2026-05-27T12:36:46Z

No description provided.

- load_smiles now shuffles its result deterministically with the seed so benches that consume a head slice get a representative cross-section rather than file-order bias. - substruct_bench.py drops its local load_pickle/load_smiles/load_smarts helpers in favor of the shared bench_utils versions. - butina_clustering_bench.py and cross_similarity_bench.py replace inline open()/pd.read_csv loaders with the shared load_smiles helper.

greptile-apps · 2026-05-27T12:40:37Z

Greptile Summary

This PR completes the loader consolidation started earlier by ensuring every molecule loader always shuffles its output, eliminating file-order bias even when the input has fewer molecules than max_count. It also migrates butina_clustering_bench.py and cross_similarity_bench.py to use the shared load_smiles loader, and removes the duplicate local load_pickle/load_smiles/load_smarts implementations from substruct_bench.py.

load_pickle, load_smiles, and load_sdf in bench_utils/loaders.py all gain an else: rng.shuffle(...) branch so benchmarks that consume a head slice always receive a representative, deterministically-seeded cross-section.
butina_clustering_bench.py and cross_similarity_bench.py replace ad-hoc CSV/SMILES parsing with load_smiles(), gaining reservoir sampling and the new shuffle guarantee.
~130 lines of duplicate loader code are deleted from substruct_bench.py and superseded by the shared bench_utils implementations.

Confidence Score: 5/5

Safe to merge; changes are confined to benchmark utilities and eliminate a reproducible source of input-ordering bias.

All three loaders correctly handle both the sampling and the no-sampling paths, and both previously-flagged gaps (load_sdf and load_pickle missing shuffle-in-else) are addressed in this diff. The benchmark consumer files are straightforward call-site migrations with no logic changes.

No files require special attention.

Important Files Changed

Filename	Overview
benchmarks/bench_utils/loaders.py	All three loaders (load_pickle, load_smiles, load_sdf) now shuffle in the else-branch when sampling is not triggered, ensuring file-order bias is removed regardless of input size vs max_count.
benchmarks/butina_clustering_bench.py	Replaced manual head-slice CSV parsing with load_smiles(), gaining reservoir sampling and shuffle behaviour.
benchmarks/cross_similarity_bench.py	Replaced pandas CSV loading + manual parsing with load_smiles(); removes pandas dependency and gains shuffle; duplicate-padding fallback loop unchanged.
benchmarks/substruct_bench.py	Removed local load_pickle, load_smiles, load_smarts duplicates and replaced with imports from bench_utils; no logic changes to benchmark code itself.

_{Reviews (2): Last reviewed commit: "Shuffle sdf and pickle too" | Re-trigger Greptile}

Shuffle sdf and pickle too

868b7be

scal444 requested a review from evasnow1992 May 27, 2026 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182
scal444 wants to merge 2 commits into
NVIDIA-Digital-Bio:mainfrom
scal444:split/loaders

scal444 commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scal444 commented May 27, 2026

Uh oh!

greptile-apps Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 27, 2026 •

edited

Loading