Skip to content

Commit b2125bf

Browse files
docs: update all docs, notebook, and CLAUDE.md for class-based clustering API
- docs/guide/analysis.md: replace old function examples with class-based API, add clustering backend selection table - CLAUDE.md: update clustering.py description in repo structure - notebooks/apex_rmsd_rmsf.ipynb: convert all cells to class-based API - tests: rename benchmark functions from cluster_conformations to gromos - zero remaining references to removed functions across entire repo
1 parent 11f5827 commit b2125bf

4 files changed

Lines changed: 666 additions & 7 deletions

File tree

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ src/mdpp/
3030
│ ├── dssp.py # secondary structure (DSSP)
3131
│ ├── decomposition.py # PCA (with projection), TICA, backbone torsion featurization
3232
│ ├── fes.py # 2D free energy surfaces
33-
│ └── clustering.py # RMSD matrix, GROMOS clustering (thin wrapper)
33+
│ └── clustering.py # RMSD matrix, clustering methods (Gromos/DBSCAN/HDBSCAN/Hierarchical/KMeans/MiniBatchKMeans/RegularSpace)
3434
├── chem/ # small-molecule cheminformatics (RDKit-based)
3535
│ ├── descriptors.py # molecular descriptor calculation and filtering
3636
│ ├── filters.py # Murcko scaffold extraction, PAINS filters

docs/guide/analysis.md

Lines changed: 53 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -266,15 +266,64 @@ fes = compute_fes_from_projection(pca_result.projections)
266266

267267
## Conformational Clustering
268268

269+
Each clustering method is a frozen dataclass configured at construction
270+
time and called on the data matrix:
271+
269272
```python
270-
from mdpp.analysis.clustering import compute_rmsd_matrix, cluster_conformations
273+
from mdpp.analysis.clustering import (
274+
Gromos, Hierarchical, DBSCAN, HDBSCAN,
275+
KMeans, MiniBatchKMeans, RegularSpace,
276+
compute_rmsd_matrix,
277+
)
278+
279+
# --- Distance-matrix methods (take an RMSD matrix) ---
271280

272281
rmsd_mat = compute_rmsd_matrix(traj, atom_selection="backbone")
273-
clusters = cluster_conformations(rmsd_mat.rmsd_matrix_nm, cutoff_nm=0.15)
274-
print(f"Found {clusters.n_clusters} clusters")
275-
print(f"Medoid frames: {clusters.medoid_frames}")
282+
283+
# GROMOS (greedy largest-cluster-first, Numba JIT, O(n) aux memory)
284+
result = Gromos(cutoff_nm=0.15)(rmsd_mat.rmsd_matrix_nm)
285+
286+
# Hierarchical agglomerative (scipy)
287+
result = Hierarchical(linkage_method="average", distance_threshold=0.2)(rmsd_mat.rmsd_matrix_nm)
288+
result = Hierarchical(linkage_method="average", n_clusters=10)(rmsd_mat.rmsd_matrix_nm)
289+
290+
# DBSCAN (Numba JIT default, sklearn backend available)
291+
result = DBSCAN(eps=0.15, min_samples=5)(rmsd_mat.rmsd_matrix_nm)
292+
result = DBSCAN(eps=0.15, min_samples=5, backend="sklearn")(rmsd_mat.rmsd_matrix_nm)
293+
294+
# HDBSCAN (sklearn)
295+
result = HDBSCAN(min_cluster_size=50, min_samples=5)(rmsd_mat.rmsd_matrix_nm)
296+
297+
print(f"Found {result.n_clusters} clusters")
298+
print(f"Medoid frames: {result.medoid_frames}")
299+
300+
# --- Feature-vector methods (take PCA/TICA projections) ---
301+
302+
from mdpp.analysis.decomposition import featurize_backbone_torsions, compute_pca
303+
304+
torsions = featurize_backbone_torsions(traj, atom_selection="protein")
305+
pca = compute_pca(torsions.values, n_components=10)
306+
307+
result = KMeans(n_clusters=10)(pca.projections)
308+
result = MiniBatchKMeans(n_clusters=10, batch_size=1024)(pca.projections)
309+
result = RegularSpace(dmin=0.5)(pca.projections)
310+
311+
# result.labels, result.n_clusters, result.cluster_centers,
312+
# result.medoid_frames, result.inertia
276313
```
277314

315+
### Clustering backend selection
316+
317+
DBSCAN provides two backends:
318+
319+
| Backend | Method | Aux memory |
320+
| ---------- | ---------------------------------- | ---------- |
321+
| `"numba"` | Custom Numba-JIT BFS (default) | O(n) |
322+
| `"sklearn"` | scikit-learn `DBSCAN(metric="precomputed")` | O(n^2) |
323+
324+
The `"numba"` backend handles 120k+ frames without OOM by reading the
325+
RMSD matrix in place and using only O(n) auxiliary buffers.
326+
278327
### RMSD matrix backend selection
279328

280329
`compute_rmsd_matrix` defaults to the **mdtraj** backend for API

0 commit comments

Comments
 (0)