docs: add clustering example notebook with all 7 methods

FridrichMethod · FridrichMethod · commit e987f5470c1d · 2026-04-12T13:11:35.000-07:00
New examples/gromacs/clustering.ipynb demonstrates:
- Distance-matrix methods: Gromos, Hierarchical, DBSCAN (numba + sklearn),
  HDBSCAN with comparison table and population plots
- Feature-vector methods: KMeans, MiniBatchKMeans, RegularSpace on
  PCA-projected backbone torsions with PC1/PC2 scatter plots
- Medoid structure extraction
diff --git a/examples/gromacs/clustering.ipynb b/examples/gromacs/clustering.ipynb
@@ -0,0 +1,348 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "6c0add61",
+   "metadata": {},
+   "source": "# `mdpp` Example: Conformational Clustering\n\nThis notebook demonstrates all clustering methods available in `mdpp`:\n\n**Distance-matrix methods** (operate on a pairwise RMSD matrix):\n\n- `Gromos` -- greedy largest-cluster-first (Numba JIT, O(n) aux memory)\n- `Hierarchical` -- agglomerative clustering (scipy)\n- `DBSCAN` -- density-based with noise detection (Numba JIT or sklearn)\n- `HDBSCAN` -- hierarchical density-based (sklearn)\n\n**Feature-vector methods** (operate on PCA/TICA projections):\n\n- `KMeans` -- standard k-means (sklearn)\n- `MiniBatchKMeans` -- scalable mini-batch variant (sklearn)\n- `RegularSpace` -- regular-space discretization (deeptime)\n\nEach method is a frozen dataclass configured at construction and called on data:\n\n```python\nresult = Gromos(cutoff_nm=0.15)(rmsd_matrix)\nresult = KMeans(n_clusters=10)(pca.projections)\n```"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3c411c15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from __future__ import annotations\n",
+    "\n",
+    "from pathlib import Path\n",
+    "\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "from mplplots.utils import auto_ticks\n",
+    "\n",
+    "from mdpp.analysis.clustering import (\n",
+    "    DBSCAN,\n",
+    "    HDBSCAN,\n",
+    "    Gromos,\n",
+    "    Hierarchical,\n",
+    "    KMeans,\n",
+    "    MiniBatchKMeans,\n",
+    "    RegularSpace,\n",
+    "    compute_rmsd_matrix,\n",
+    ")\n",
+    "from mdpp.analysis.decomposition import compute_pca, featurize_backbone_torsions\n",
+    "from mdpp.core.trajectory import align_trajectory, load_trajectory\n",
+    "\n",
+    "plt.style.use(\"mplplots.styles.GraphPadPrism\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "39804045",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "TOPOLOGY_PATH = Path(\"/path/to/topology.pdb\")\n",
+    "TRAJECTORY_PATH = Path(\"/path/to/trajectory.xtc\")\n",
+    "STRIDE = 5\n",
+    "CUTOFF_NM = 0.15\n",
+    "\n",
+    "if not TOPOLOGY_PATH.exists() or not TRAJECTORY_PATH.exists():\n",
+    "    raise FileNotFoundError(\n",
+    "        \"Update TOPOLOGY_PATH and TRAJECTORY_PATH before running analysis cells.\"\n",
+    "    )\n",
+    "\n",
+    "traj = load_trajectory(\n",
+    "    trajectory_path=TRAJECTORY_PATH,\n",
+    "    topology_path=TOPOLOGY_PATH,\n",
+    "    stride=STRIDE,\n",
+    "    atom_selection=\"protein\",\n",
+    ")\n",
+    "traj = align_trajectory(traj, atom_selection=\"name CA\")\n",
+    "\n",
+    "print(f\"Frames: {traj.n_frames}, Atoms: {traj.n_atoms}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce40bec9",
+   "metadata": {},
+   "source": "## Compute RMSD Matrix\n\nThe pairwise RMSD matrix is shared by all distance-matrix clustering methods.\nUse `backend=\"numba\"` or `backend=\"torch\"` for large trajectories."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "36584688",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "rmsd_mat = compute_rmsd_matrix(traj, atom_selection=\"backbone\", backend=\"numba\")\n",
+    "\n",
+    "print(f\"RMSD matrix: {rmsd_mat.rmsd_matrix_nm.shape}, dtype={rmsd_mat.rmsd_matrix_nm.dtype}\")\n",
+    "print(f\"Range: {rmsd_mat.rmsd_matrix_nm.max():.3f} nm\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "dd1d4045",
+   "metadata": {},
+   "source": "## Distance-Matrix Methods\n\n### GROMOS\n\nGreedy largest-cluster-first assignment. Custom Numba kernel with O(n) auxiliary memory -- handles 120k+ frames."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c5ffad58",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "gromos = Gromos(cutoff_nm=CUTOFF_NM)(rmsd_mat.rmsd_matrix_nm)\n",
+    "\n",
+    "print(f\"GROMOS: {gromos.n_clusters} clusters\")\n",
+    "for i in range(min(5, gromos.n_clusters)):\n",
+    "    count = int(np.sum(gromos.labels == i))\n",
+    "    print(f\"  Cluster {i}: {count} frames, medoid={gromos.medoid_frames[i]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ceb1c9b1",
+   "metadata": {},
+   "source": "### Hierarchical\n\nAgglomerative clustering via scipy. Supports `distance_threshold` (default) or fixed `n_clusters`."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "92b829ce",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Distance-threshold mode (like GROMOS cutoff)\n",
+    "hier_dist = Hierarchical(\n",
+    "    linkage_method=\"average\",\n",
+    "    distance_threshold=CUTOFF_NM,\n",
+    ")(rmsd_mat.rmsd_matrix_nm)\n",
+    "\n",
+    "# Fixed cluster count mode\n",
+    "hier_k = Hierarchical(\n",
+    "    linkage_method=\"average\",\n",
+    "    n_clusters=5,\n",
+    ")(rmsd_mat.rmsd_matrix_nm)\n",
+    "\n",
+    "print(f\"Hierarchical (distance_threshold={CUTOFF_NM}): {hier_dist.n_clusters} clusters\")\n",
+    "print(f\"Hierarchical (n_clusters=5): {hier_k.n_clusters} clusters\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e7f99fa",
+   "metadata": {},
+   "source": "### DBSCAN\n\nDensity-based clustering with noise detection. Frames that don't belong to any dense region get label -1.\n\nThe default `backend=\"numba\"` uses a custom Numba kernel with O(n) auxiliary memory. Pass `backend=\"sklearn\"` for the official scikit-learn implementation."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7b330e5f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dbscan = DBSCAN(eps=CUTOFF_NM, min_samples=5)(rmsd_mat.rmsd_matrix_nm)\n",
+    "\n",
+    "noise = int(np.sum(dbscan.labels == -1))\n",
+    "print(f\"DBSCAN: {dbscan.n_clusters} clusters, {noise} noise frames\")\n",
+    "\n",
+    "# sklearn backend for comparison\n",
+    "dbscan_sk = DBSCAN(eps=CUTOFF_NM, min_samples=5, backend=\"sklearn\")(rmsd_mat.rmsd_matrix_nm)\n",
+    "print(f\"DBSCAN (sklearn): {dbscan_sk.n_clusters} clusters\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "390f9624",
+   "metadata": {},
+   "source": "### HDBSCAN\n\nHierarchical density-based clustering via sklearn. Handles clusters of varying density without an epsilon parameter."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b4d0516",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "hdbscan = HDBSCAN(min_cluster_size=10, min_samples=5)(rmsd_mat.rmsd_matrix_nm)\n",
+    "\n",
+    "noise = int(np.sum(hdbscan.labels == -1))\n",
+    "print(f\"HDBSCAN: {hdbscan.n_clusters} clusters, {noise} noise frames\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0d3594cd",
+   "metadata": {},
+   "source": "### Compare Distance-Matrix Methods"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70da02ff",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "results = {\n",
+    "    \"GROMOS\": gromos,\n",
+    "    \"Hierarchical\": hier_dist,\n",
+    "    \"DBSCAN\": dbscan,\n",
+    "    \"HDBSCAN\": hdbscan,\n",
+    "}\n",
+    "\n",
+    "n_total = traj.n_frames\n",
+    "print(f\"{'Method':<15s} {'Clusters':>10s} {'Noise':>8s} {'Largest':>10s}\")\n",
+    "print(\"-\" * 45)\n",
+    "for name, r in results.items():\n",
+    "    noise = int(np.sum(r.labels == -1))\n",
+    "    valid = r.labels[r.labels >= 0]\n",
+    "    largest = int(np.bincount(valid).max()) if len(valid) > 0 else 0\n",
+    "    pct = largest / n_total * 100\n",
+    "    print(f\"{name:<15s} {r.n_clusters:>10d} {noise:>8d} {largest:>6d} ({pct:.1f}%)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "bde64aa1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(2, 2, figsize=(12, 8), dpi=120, sharey=True)\n",
+    "\n",
+    "for ax, (name, r) in zip(axes.ravel(), results.items()):\n",
+    "    valid = r.labels[r.labels >= 0]\n",
+    "    if len(valid) > 0:\n",
+    "        counts = np.bincount(valid)\n",
+    "        top_k = min(20, len(counts))\n",
+    "        ax.bar(range(top_k), counts[:top_k])\n",
+    "    ax.set_xlabel(\"Cluster\")\n",
+    "    ax.set_title(f\"{name} ({r.n_clusters} clusters)\")\n",
+    "    auto_ticks(ax)\n",
+    "\n",
+    "axes[0, 0].set_ylabel(\"Frames\")\n",
+    "axes[1, 0].set_ylabel(\"Frames\")\n",
+    "fig.suptitle(f\"Cluster Populations (cutoff = {CUTOFF_NM} nm)\", y=1.02)\n",
+    "fig.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "594c705e",
+   "metadata": {},
+   "source": "## Feature-Vector Methods\n\nBackbone torsion featurization (sin/cos embedded phi/psi) followed by PCA.\nFeature-based methods scale linearly with N and don't require the RMSD matrix."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0928a15",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "torsions = featurize_backbone_torsions(traj, atom_selection=\"protein\")\n",
+    "pca = compute_pca(torsions.values, n_components=10)\n",
+    "\n",
+    "print(f\"Torsion features: {torsions.values.shape[1]}\")\n",
+    "print(\n",
+    "    f\"PCA: {pca.projections.shape[1]} components, \"\n",
+    "    f\"explained variance = {pca.explained_variance_ratio.sum():.1%}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "95842a0d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "N_CLUSTERS = 5\n",
+    "\n",
+    "km = KMeans(n_clusters=N_CLUSTERS)(pca.projections)\n",
+    "mb = MiniBatchKMeans(n_clusters=N_CLUSTERS, batch_size=256)(pca.projections)\n",
+    "rs = RegularSpace(dmin=1.0)(pca.projections)\n",
+    "\n",
+    "print(f\"KMeans:         {km.n_clusters} clusters, inertia={km.inertia:.1f}\")\n",
+    "print(f\"MiniBatchKMeans: {mb.n_clusters} clusters, inertia={mb.inertia:.1f}\")\n",
+    "print(f\"RegularSpace:   {rs.n_clusters} clusters (dmin=1.0)\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c0bc4780",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fig, axes = plt.subplots(1, 3, figsize=(16, 4.5), dpi=120)\n",
+    "\n",
+    "for ax, (name, r) in zip(axes, [(\"KMeans\", km), (\"MiniBatch\", mb), (\"RegularSpace\", rs)]):\n",
+    "    sc = ax.scatter(\n",
+    "        pca.projections[:, 0],\n",
+    "        pca.projections[:, 1],\n",
+    "        c=r.labels,\n",
+    "        cmap=\"tab10\",\n",
+    "        s=2,\n",
+    "        alpha=0.4,\n",
+    "        rasterized=True,\n",
+    "    )\n",
+    "    ax.scatter(\n",
+    "        r.cluster_centers[:, 0],\n",
+    "        r.cluster_centers[:, 1],\n",
+    "        c=\"black\",\n",
+    "        marker=\"x\",\n",
+    "        s=100,\n",
+    "        linewidths=2,\n",
+    "        zorder=5,\n",
+    "    )\n",
+    "    ax.set_xlabel(\"PC1\")\n",
+    "    ax.set_ylabel(\"PC2\")\n",
+    "    ax.set_title(f\"{name} ({r.n_clusters} clusters)\")\n",
+    "\n",
+    "fig.tight_layout()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "639a69e0",
+   "metadata": {},
+   "source": "## Save Medoid Structures\n\nExtract representative frames from the GROMOS result."
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6b18c017",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "output_dir = Path(\"cluster_medoids\")\n",
+    "output_dir.mkdir(exist_ok=True)\n",
+    "\n",
+    "for i, frame_idx in enumerate(gromos.medoid_frames[:10]):\n",
+    "    out = output_dir / f\"cluster{i}_medoid.pdb\"\n",
+    "    traj[int(frame_idx)].save(str(out))\n",
+    "    count = int(np.sum(gromos.labels == i))\n",
+    "    print(f\"Cluster {i}: {count} frames, medoid frame {frame_idx} -> {out}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}