Skip to content

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182

Open
scal444 wants to merge 2 commits into
NVIDIA-Digital-Bio:mainfrom
scal444:split/loaders
Open

Consolidate smiles/smarts loaders further in benchmarks, ensure shuffling#182
scal444 wants to merge 2 commits into
NVIDIA-Digital-Bio:mainfrom
scal444:split/loaders

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented May 27, 2026

No description provided.

- load_smiles now shuffles its result deterministically with the seed so
  benches that consume a head slice get a representative cross-section
  rather than file-order bias.
- substruct_bench.py drops its local load_pickle/load_smiles/load_smarts
  helpers in favor of the shared bench_utils versions.
- butina_clustering_bench.py and cross_similarity_bench.py replace inline
  open()/pd.read_csv loaders with the shared load_smiles helper.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR completes the loader consolidation started earlier by ensuring every molecule loader always shuffles its output, eliminating file-order bias even when the input has fewer molecules than max_count. It also migrates butina_clustering_bench.py and cross_similarity_bench.py to use the shared load_smiles loader, and removes the duplicate local load_pickle/load_smiles/load_smarts implementations from substruct_bench.py.

  • load_pickle, load_smiles, and load_sdf in bench_utils/loaders.py all gain an else: rng.shuffle(...) branch so benchmarks that consume a head slice always receive a representative, deterministically-seeded cross-section.
  • butina_clustering_bench.py and cross_similarity_bench.py replace ad-hoc CSV/SMILES parsing with load_smiles(), gaining reservoir sampling and the new shuffle guarantee.
  • ~130 lines of duplicate loader code are deleted from substruct_bench.py and superseded by the shared bench_utils implementations.

Confidence Score: 5/5

Safe to merge; changes are confined to benchmark utilities and eliminate a reproducible source of input-ordering bias.

All three loaders correctly handle both the sampling and the no-sampling paths, and both previously-flagged gaps (load_sdf and load_pickle missing shuffle-in-else) are addressed in this diff. The benchmark consumer files are straightforward call-site migrations with no logic changes.

No files require special attention.

Important Files Changed

Filename Overview
benchmarks/bench_utils/loaders.py All three loaders (load_pickle, load_smiles, load_sdf) now shuffle in the else-branch when sampling is not triggered, ensuring file-order bias is removed regardless of input size vs max_count.
benchmarks/butina_clustering_bench.py Replaced manual head-slice CSV parsing with load_smiles(), gaining reservoir sampling and shuffle behaviour.
benchmarks/cross_similarity_bench.py Replaced pandas CSV loading + manual parsing with load_smiles(); removes pandas dependency and gains shuffle; duplicate-padding fallback loop unchanged.
benchmarks/substruct_bench.py Removed local load_pickle, load_smiles, load_smarts duplicates and replaced with imports from bench_utils; no logic changes to benchmark code itself.

Reviews (2): Last reviewed commit: "Shuffle sdf and pickle too" | Re-trigger Greptile

@scal444 scal444 requested a review from evasnow1992 May 27, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant