Acellera
diff --git a/‎doc/source/explanation/molecule-data-model.md‎
Lines changed: 1 addition & 1 deletion b/‎doc/source/explanation/molecule-data-model.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎doc/source/explanation/segments-chains-and-bonds.md‎
Lines changed: 39 additions & 14 deletions b/‎doc/source/explanation/segments-chains-and-bonds.md‎
Lines changed: 39 additions & 14 deletions
diff --git a/‎doc/source/howto/assign-segments-and-chains.md‎
Lines changed: 20 additions & 5 deletions b/‎doc/source/howto/assign-segments-and-chains.md‎
Lines changed: 20 additions & 5 deletions
diff --git a/‎doc/source/tutorials/system-prep/04-mutation-gap-closing-segmentation.md‎
Lines changed: 2 additions & 2 deletions b/‎doc/source/tutorials/system-prep/04-mutation-gap-closing-segmentation.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎moleculekit/share/atomselect/atomselect.json‎
Lines changed: 2 additions & 1 deletion b/‎moleculekit/share/atomselect/atomselect.json‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎moleculekit/tools/atomtyper.py‎
Lines changed: 10 additions & 8 deletions b/‎moleculekit/tools/atomtyper.py‎
Lines changed: 10 additions & 8 deletions
@@ -133,7 +133,7 @@ Moleculekit uses a single distance unit throughout: **Ångström (Å)**. This ap
 - `mol.coords` — atomic positions.
 - `mol.box` — periodic-box lengths.
 - All readers and writers (regardless of the source format's native units — GROMACS' `.gro` / `.xtc` use nanometres on disk; moleculekit converts to Å on load and converts back on write).
-- All distance parameters in the library (`coldist`, `spatialgap`, `find_clashes` thresholds, `within X of` selections, etc.).
+- All distance parameters in the library (`coldist`, `autoSegment`'s `protein_cutoff`, `find_clashes` thresholds, `within X of` selections, etc.).
 
 Angles — `mol.boxangles`, dihedrals returned by {py:meth}`~moleculekit.molecule.Molecule.getDihedral`, and rotation angles passed to {py:meth}`~moleculekit.molecule.Molecule.setDihedral` — are in **radians** for the function APIs, except `mol.boxangles` which is in degrees (matching the PDB convention).
 
 
@@ -68,9 +68,12 @@ If you load a PDB and then hand it to a parameterizer without populating
 
 ## autoSegment: populating segments automatically
 
-{py:func}`~moleculekit.tools.autosegment.autoSegment` walks the **residue-ID sequence** within a selection and assigns
-a new segment ID whenever it detects a gap in residue numbering. Each
-contiguous run of residues becomes one segment:
+{py:func}`~moleculekit.tools.autosegment.autoSegment` assigns segments by
+following the **physical backbone** of each polymer. It walks the residues of a
+selection in file order and starts a new segment only where the backbone is
+actually broken — judged from atomic coordinates, not from residue numbering —
+so it is robust to structures where residues were deleted from the sequence
+while the chain stayed continuous.
 
 ```python
 from moleculekit.tools.autosegment import autoSegment
@@ -80,17 +83,39 @@ import numpy as np
 print(np.unique(mol_seg.segid))   # e.g. ['P0', 'P1', 'P2']
 ```
 
-Each contiguous polypeptide run (and each contiguous run of water residues)
-gets the next `P{i}` segid. Water residues additionally receive `chain = "W"`
-to keep them visually distinct from polymer chains, but their **segid** stays
-on the same `P{i}` numbering sequence as everything else; there is no `W0`
-segid.
-
-The function accepts a `basename` argument to control naming: `basename="P"`
-produces `P0`, `P1`, `P2`, etc. The `fields` argument controls which field(s)
-are written: `("segid",)` (default), `("chain",)`, or `("segid", "chain")`.
-
-{py:func}`~moleculekit.tools.autosegment.autoSegment` detects gaps by checking for breaks in `resid` sequence and is the supported entry point.
+A new segment begins between two consecutive residues when **any** of these holds:
+
+- the backbone link distance exceeds the cutoff — for protein the `C(i)–N(i+1)`
+  peptide bond (default `protein_cutoff=2.0` Å, falling back to `CA–CA` when the
+  carbonyl/amide atoms are missing), for nucleic acids the `O3'(i)–P(i+1)`
+  phosphodiester bond (default `nucleic_cutoff=2.2` Å);
+- the `chain` or `segid` already recorded in the file changes;
+- the polymer type changes (protein vs nucleic).
+
+Because continuity is judged from coordinates, a gap in residue numbering with an
+intact backbone (for example residues mutated out of a sequence) stays a
+**single** segment, while a genuine spatial break — even one with continuous
+numbering — is split into two.
+
+Non-polymer atoms are grouped separately: all **water** collapses into one
+segment, all **ions** into another, and the remaining molecules ("other") are
+split into one segment per bonded molecule. Pass `single_other_segment=True` to
+place all of the "other" molecules into a single segment instead.
+
+The `basename` argument controls naming (`basename="P"` produces `P0`, `P1`,
+`P2`, ...). The `fields` argument controls which field(s) are written:
+`("segid",)` (default), `("chain",)`, or `("segid", "chain")`.
+
+Run {py:func}`~moleculekit.tools.autosegment.autoSegment` before
+{py:func}`~moleculekit.tools.preparation.systemPrepare` so the parameterizer
+receives a populated, consistent `segid`.
+
+```{note}
+The older `autoSegment2` — which segmented by the covalent bond graph — is
+deprecated. It now forwards to
+{py:func}`~moleculekit.tools.autosegment.autoSegment` and emits a
+`DeprecationWarning`; call `autoSegment` directly instead.
+```
 
 ## Bonds: the connectivity layer
 
 
@@ -2,7 +2,7 @@
 
 ## Goal
 
-Derive `segid` and/or `chain` fields for a structure that lacks them, using gap detection to split continuous segments automatically.
+Derive `segid` and/or `chain` fields for a structure that lacks them, splitting the system into segments by following each polymer's physical backbone.
 
 ## Minimal example
 
@@ -15,29 +15,44 @@ mol = autoSegment(mol)
 print(set(mol.segid))
 ```
 
+## How segments are decided
+
+A new segment starts between two consecutive residues when any of these holds: the backbone link distance exceeds the cutoff (protein `C(i)–N(i+1)`, nucleic `O3'(i)–P(i+1)`), the `chain` or `segid` already in the file changes, or the polymer type changes. Water collapses into one segment, ions into another, and the remaining ("other") molecules are split one segment per bonded molecule. Because continuity is read from coordinates, a gap in residue *numbering* with an intact backbone stays one segment, while a real spatial break is split.
+
 ## Parameters that matter
 
 | Parameter | Type | Default | What it does |
 |---|---|---|---|
 | `mol` | {py:class}`~moleculekit.molecule.Molecule` | required | Input molecule (a copy is returned; original is unchanged) |
-| `sel` | `str` | `"all"` | Restrict gap detection to this atom selection |
+| `sel` | `str` | `"all"` | Restrict segmentation to this atom selection; atoms outside keep their existing `chain`/`segid` |
 | `basename` | `str` | `"P"` | Prefix for generated segment names, e.g. `"P"` → `"P0"`, `"P1"`, … |
-| `spatial` | `bool` | `True` | Treat a residue-numbering gap as a real gap only if Cα distance > `spatialgap` Å |
-| `spatialgap` | `float` | `4.0` | Distance threshold in Å for spatial gap detection |
+| `fields` | `tuple` | `("segid",)` | Which field(s) to write: any combination of `"segid"` and `"chain"` |
+| `protein_cutoff` | `float` | `2.0` | Max `C(i)–N(i+1)` distance (Å) for two protein residues to be continuous |
+| `nucleic_cutoff` | `float` | `2.2` | Max `O3'(i)–P(i+1)` distance (Å) for two nucleic residues to be continuous |
+| `ca_fallback_cutoff` | `float` | `5.0` | Max `CA–CA` distance (Å) used when a protein residue lacks `C`/`N` |
+| `nucleic_fallback_cutoff` | `float` | `3.2` | Max `C3'–P` distance (Å) used when a nucleic residue lacks `O3'` |
+| `single_other_segment` | `bool` | `False` | Put all non-polymer, non-water, non-ion molecules into one segment instead of one per molecule |
 
 ## Common variations
 
 ```python
 # Assign segments to protein chains only
 mol = autoSegment(mol, sel="protein")
+
+# Write both chain and segid in one call
+mol = autoSegment(mol, fields=("chain", "segid"))
+
+# Lump every ligand/cofactor into a single "other" segment
+mol = autoSegment(mol, single_other_segment=True)
 ```
 
 ## Gotchas
 
 - {py:func}`~moleculekit.tools.autosegment.autoSegment` returns a new {py:class}`~moleculekit.molecule.Molecule`; it does not mutate the input.
+- Only coordinates and atom names are needed — explicit bonds are not required (they are guessed only for the "other" bucket).
 - `segid` can be up to 4 characters (MD force-field convention); `chain` is a single character (PDB convention).
-- Auto-assignment is topology-driven and can fail on structures with non-contiguous or missing residue numbers — inspect the result before use.
 - When writing to PDB, only the `chain` field is stored in the standard CHAIN column; `segid` goes into the SEGID column, which many programs ignore.
+- `autoSegment2` is deprecated and forwards to {py:func}`~moleculekit.tools.autosegment.autoSegment` with a `DeprecationWarning`; use `autoSegment` directly.
 
 ## See also
 
 
@@ -44,7 +44,7 @@ mol = autoSegment(mol, sel="protein", fields=("chain", "segid"))
 sorted(set(zip(mol.chain, mol.segid)))
 ```
 
-{py:func}`~moleculekit.tools.autosegment.autoSegment` detects that the backbone distance between GLY 140 and MET 154 (the flanking residues of the gap) exceeds the default 4 Å spatial threshold, and so it creates two independent segments: `P0` on chain A (residues 55–140) and `P1` on chain B (residues 154–209). Both the `chain` and `segid` fields are now consistent, which avoids warnings during {py:func}`~moleculekit.tools.preparation.systemPrepare`.
+{py:func}`~moleculekit.tools.autosegment.autoSegment` detects that the backbone is broken between GLY 140 and MET 154 (the flanking residues of the gap) — their `C–N` distance far exceeds the peptide-bond cutoff (`protein_cutoff`, 2 Å by default) — and so it creates two independent segments: `P0` on chain A (residues 55–140) and `P1` on chain B (residues 154–209). Both the `chain` and `segid` fields are now consistent, which avoids warnings during {py:func}`~moleculekit.tools.preparation.systemPrepare`.
 
 ## Step 2 — Mutate a residue with the "best" rotamer
 
@@ -113,7 +113,7 @@ The full pipeline — segment, mutate, prepare — is now complete.
 
 ## Recap
 
-- {py:func}`~moleculekit.tools.autosegment.autoSegment` detects backbone discontinuities and assigns a unique segid (and optionally chain letter) per topologically connected fragment; use `fields=("chain", "segid")` to keep both fields consistent.
+- {py:func}`~moleculekit.tools.autosegment.autoSegment` detects backbone discontinuities from atomic coordinates and assigns a unique segid (and optionally chain letter) per backbone-continuous segment; use `fields=("chain", "segid")` to keep both fields consistent.
 - {py:meth}`~moleculekit.molecule.Molecule.mutateResidue` with `sel` and `newres` swaps a residue's sidechain using Dunbrack rotamer selection: `rotamer_mode="best"` minimises VdW clashes against neighbours, `rotamer_mode="random"` samples by probability for speed. Add `minimize=True` to relax residual strain with OpenMM.
 - {py:func}`~moleculekit.tools.modelling.model_gaps` fills missing residues by sequence using the ProMod3 loop-modelling engine — but it requires the ProMod3 Singularity image; there is no fallback.
 
 
@@ -76,7 +76,8 @@
         "TIP2",
         "TIP3",
         "TIP4",
-        "SPC"
+        "SPC",
+        "DOD"
     ],
     "nucleic_backbone_names": [
         "P",
 
@@ -121,7 +121,7 @@ def prepareProteinForAtomtyping(
     mol : Molecule object
         The prepared Molecule
     """
-    from moleculekit.tools.autosegment import autoSegment2
+    from moleculekit.tools.autosegment import autoSegment
     from moleculekit.util import sequenceID
 
     mol = mol.copy()
@@ -157,7 +157,7 @@ def prepareProteinForAtomtyping(
         from moleculekit.tools.preparation import systemPrepare
 
         if np.all(protmol.segid == "") and np.all(protmol.chain == ""):
-            protmol = autoSegment2(
+            protmol = autoSegment(
                 protmol, fields=("segid", "chain"), basename="K", _logger=verbose
             )  # We need segments to prepare the protein
         protmol, _ = systemPrepare(
@@ -172,7 +172,7 @@ def prepareProteinForAtomtyping(
         # TODO: Should we remove bonds between metals and protein?
 
     if segment:
-        protmol = autoSegment2(
+        protmol = autoSegment(
             protmol, fields=("segid", "chain"), _logger=verbose
         )  # Reassign segments after preparation
 
@@ -241,20 +241,22 @@ def atomtypingValidityChecks(mol):
 
     if np.all(mol.segid == "") or np.all(mol.chain == ""):
         raise RuntimeError(
-            "Please assign segments to the segid and chain fields of the molecule using autoSegment2"
+            "Please assign segments to the segid and chain fields of the molecule using autoSegment"
         )
 
-    from moleculekit.tools.autosegment import autoSegment2
+    from moleculekit.tools.autosegment import autoSegment
 
     mm = mol.copy()
-    mm.segid[:] = ""  # Set segid and chain to '' to avoid name clashes in autoSegment2
+    mm.segid[:] = ""  # Set segid and chain to '' to avoid name clashes in autoSegment
     mm.chain[:] = ""
-    refmol = autoSegment2(mm, fields=("chain", "segid"), _logger=False)
+    refmol = autoSegment(
+        mm, sel="protein or resname ACE NME", fields=("chain", "segid"), _logger=False
+    )
     numsegsref = len(np.unique(refmol.segid))
     numsegs = len(np.unique(mol.segid))
     if numsegs != numsegsref:
         raise RuntimeError(
-            "The molecule contains {} segments while we predict {}. Make sure you used autoSegment2 on the protein".format(
+            "The molecule contains {} segments while we predict {}. Make sure you used autoSegment on the protein".format(
                 numsegs, numsegsref
             )
         )
-Original file line number
+Diff line change
         "TIP2",
         "TIP3",
         "TIP4",
 -        "SPC"
 +        "SPC",
 +        "DOD"
     ],
     "nucleic_backbone_names": [
         "P",