Skip to content

Commit 0fca0a1

Browse files
committed
updated autosegmentation
1 parent e78e1aa commit 0fca0a1

28 files changed

Lines changed: 127310 additions & 395 deletions

doc/source/explanation/molecule-data-model.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@ Moleculekit uses a single distance unit throughout: **Ångström (Å)**. This ap
133133
- `mol.coords` — atomic positions.
134134
- `mol.box` — periodic-box lengths.
135135
- All readers and writers (regardless of the source format's native units — GROMACS' `.gro` / `.xtc` use nanometres on disk; moleculekit converts to Å on load and converts back on write).
136-
- All distance parameters in the library (`coldist`, `spatialgap`, `find_clashes` thresholds, `within X of` selections, etc.).
136+
- All distance parameters in the library (`coldist`, `autoSegment`'s `protein_cutoff`, `find_clashes` thresholds, `within X of` selections, etc.).
137137

138138
Angles — `mol.boxangles`, dihedrals returned by {py:meth}`~moleculekit.molecule.Molecule.getDihedral`, and rotation angles passed to {py:meth}`~moleculekit.molecule.Molecule.setDihedral` — are in **radians** for the function APIs, except `mol.boxangles` which is in degrees (matching the PDB convention).
139139

doc/source/explanation/segments-chains-and-bonds.md

Lines changed: 39 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -68,9 +68,12 @@ If you load a PDB and then hand it to a parameterizer without populating
6868

6969
## autoSegment: populating segments automatically
7070

71-
{py:func}`~moleculekit.tools.autosegment.autoSegment` walks the **residue-ID sequence** within a selection and assigns
72-
a new segment ID whenever it detects a gap in residue numbering. Each
73-
contiguous run of residues becomes one segment:
71+
{py:func}`~moleculekit.tools.autosegment.autoSegment` assigns segments by
72+
following the **physical backbone** of each polymer. It walks the residues of a
73+
selection in file order and starts a new segment only where the backbone is
74+
actually broken — judged from atomic coordinates, not from residue numbering —
75+
so it is robust to structures where residues were deleted from the sequence
76+
while the chain stayed continuous.
7477

7578
```python
7679
from moleculekit.tools.autosegment import autoSegment
@@ -80,17 +83,39 @@ import numpy as np
8083
print(np.unique(mol_seg.segid)) # e.g. ['P0', 'P1', 'P2']
8184
```
8285

83-
Each contiguous polypeptide run (and each contiguous run of water residues)
84-
gets the next `P{i}` segid. Water residues additionally receive `chain = "W"`
85-
to keep them visually distinct from polymer chains, but their **segid** stays
86-
on the same `P{i}` numbering sequence as everything else; there is no `W0`
87-
segid.
88-
89-
The function accepts a `basename` argument to control naming: `basename="P"`
90-
produces `P0`, `P1`, `P2`, etc. The `fields` argument controls which field(s)
91-
are written: `("segid",)` (default), `("chain",)`, or `("segid", "chain")`.
92-
93-
{py:func}`~moleculekit.tools.autosegment.autoSegment` detects gaps by checking for breaks in `resid` sequence and is the supported entry point.
86+
A new segment begins between two consecutive residues when **any** of these holds:
87+
88+
- the backbone link distance exceeds the cutoff — for protein the `C(i)–N(i+1)`
89+
peptide bond (default `protein_cutoff=2.0` Å, falling back to `CA–CA` when the
90+
carbonyl/amide atoms are missing), for nucleic acids the `O3'(i)–P(i+1)`
91+
phosphodiester bond (default `nucleic_cutoff=2.2` Å);
92+
- the `chain` or `segid` already recorded in the file changes;
93+
- the polymer type changes (protein vs nucleic).
94+
95+
Because continuity is judged from coordinates, a gap in residue numbering with an
96+
intact backbone (for example residues mutated out of a sequence) stays a
97+
**single** segment, while a genuine spatial break — even one with continuous
98+
numbering — is split into two.
99+
100+
Non-polymer atoms are grouped separately: all **water** collapses into one
101+
segment, all **ions** into another, and the remaining molecules ("other") are
102+
split into one segment per bonded molecule. Pass `single_other_segment=True` to
103+
place all of the "other" molecules into a single segment instead.
104+
105+
The `basename` argument controls naming (`basename="P"` produces `P0`, `P1`,
106+
`P2`, ...). The `fields` argument controls which field(s) are written:
107+
`("segid",)` (default), `("chain",)`, or `("segid", "chain")`.
108+
109+
Run {py:func}`~moleculekit.tools.autosegment.autoSegment` before
110+
{py:func}`~moleculekit.tools.preparation.systemPrepare` so the parameterizer
111+
receives a populated, consistent `segid`.
112+
113+
```{note}
114+
The older `autoSegment2` — which segmented by the covalent bond graph — is
115+
deprecated. It now forwards to
116+
{py:func}`~moleculekit.tools.autosegment.autoSegment` and emits a
117+
`DeprecationWarning`; call `autoSegment` directly instead.
118+
```
94119

95120
## Bonds: the connectivity layer
96121

doc/source/howto/assign-segments-and-chains.md

Lines changed: 20 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Goal
44

5-
Derive `segid` and/or `chain` fields for a structure that lacks them, using gap detection to split continuous segments automatically.
5+
Derive `segid` and/or `chain` fields for a structure that lacks them, splitting the system into segments by following each polymer's physical backbone.
66

77
## Minimal example
88

@@ -15,29 +15,44 @@ mol = autoSegment(mol)
1515
print(set(mol.segid))
1616
```
1717

18+
## How segments are decided
19+
20+
A new segment starts between two consecutive residues when any of these holds: the backbone link distance exceeds the cutoff (protein `C(i)–N(i+1)`, nucleic `O3'(i)–P(i+1)`), the `chain` or `segid` already in the file changes, or the polymer type changes. Water collapses into one segment, ions into another, and the remaining ("other") molecules are split one segment per bonded molecule. Because continuity is read from coordinates, a gap in residue *numbering* with an intact backbone stays one segment, while a real spatial break is split.
21+
1822
## Parameters that matter
1923

2024
| Parameter | Type | Default | What it does |
2125
|---|---|---|---|
2226
| `mol` | {py:class}`~moleculekit.molecule.Molecule` | required | Input molecule (a copy is returned; original is unchanged) |
23-
| `sel` | `str` | `"all"` | Restrict gap detection to this atom selection |
27+
| `sel` | `str` | `"all"` | Restrict segmentation to this atom selection; atoms outside keep their existing `chain`/`segid` |
2428
| `basename` | `str` | `"P"` | Prefix for generated segment names, e.g. `"P"``"P0"`, `"P1"`, … |
25-
| `spatial` | `bool` | `True` | Treat a residue-numbering gap as a real gap only if Cα distance > `spatialgap` Å |
26-
| `spatialgap` | `float` | `4.0` | Distance threshold in Å for spatial gap detection |
29+
| `fields` | `tuple` | `("segid",)` | Which field(s) to write: any combination of `"segid"` and `"chain"` |
30+
| `protein_cutoff` | `float` | `2.0` | Max `C(i)–N(i+1)` distance (Å) for two protein residues to be continuous |
31+
| `nucleic_cutoff` | `float` | `2.2` | Max `O3'(i)–P(i+1)` distance (Å) for two nucleic residues to be continuous |
32+
| `ca_fallback_cutoff` | `float` | `5.0` | Max `CA–CA` distance (Å) used when a protein residue lacks `C`/`N` |
33+
| `nucleic_fallback_cutoff` | `float` | `3.2` | Max `C3'–P` distance (Å) used when a nucleic residue lacks `O3'` |
34+
| `single_other_segment` | `bool` | `False` | Put all non-polymer, non-water, non-ion molecules into one segment instead of one per molecule |
2735

2836
## Common variations
2937

3038
```python
3139
# Assign segments to protein chains only
3240
mol = autoSegment(mol, sel="protein")
41+
42+
# Write both chain and segid in one call
43+
mol = autoSegment(mol, fields=("chain", "segid"))
44+
45+
# Lump every ligand/cofactor into a single "other" segment
46+
mol = autoSegment(mol, single_other_segment=True)
3347
```
3448

3549
## Gotchas
3650

3751
- {py:func}`~moleculekit.tools.autosegment.autoSegment` returns a new {py:class}`~moleculekit.molecule.Molecule`; it does not mutate the input.
52+
- Only coordinates and atom names are needed — explicit bonds are not required (they are guessed only for the "other" bucket).
3853
- `segid` can be up to 4 characters (MD force-field convention); `chain` is a single character (PDB convention).
39-
- Auto-assignment is topology-driven and can fail on structures with non-contiguous or missing residue numbers — inspect the result before use.
4054
- When writing to PDB, only the `chain` field is stored in the standard CHAIN column; `segid` goes into the SEGID column, which many programs ignore.
55+
- `autoSegment2` is deprecated and forwards to {py:func}`~moleculekit.tools.autosegment.autoSegment` with a `DeprecationWarning`; use `autoSegment` directly.
4156

4257
## See also
4358

doc/source/tutorials/system-prep/04-mutation-gap-closing-segmentation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ mol = autoSegment(mol, sel="protein", fields=("chain", "segid"))
4444
sorted(set(zip(mol.chain, mol.segid)))
4545
```
4646

47-
{py:func}`~moleculekit.tools.autosegment.autoSegment` detects that the backbone distance between GLY 140 and MET 154 (the flanking residues of the gap) exceeds the default 4 Å spatial threshold, and so it creates two independent segments: `P0` on chain A (residues 55–140) and `P1` on chain B (residues 154–209). Both the `chain` and `segid` fields are now consistent, which avoids warnings during {py:func}`~moleculekit.tools.preparation.systemPrepare`.
47+
{py:func}`~moleculekit.tools.autosegment.autoSegment` detects that the backbone is broken between GLY 140 and MET 154 (the flanking residues of the gap) — their `C–N` distance far exceeds the peptide-bond cutoff (`protein_cutoff`, 2 Å by default) — and so it creates two independent segments: `P0` on chain A (residues 55–140) and `P1` on chain B (residues 154–209). Both the `chain` and `segid` fields are now consistent, which avoids warnings during {py:func}`~moleculekit.tools.preparation.systemPrepare`.
4848

4949
## Step 2 — Mutate a residue with the "best" rotamer
5050

@@ -113,7 +113,7 @@ The full pipeline — segment, mutate, prepare — is now complete.
113113

114114
## Recap
115115

116-
- {py:func}`~moleculekit.tools.autosegment.autoSegment` detects backbone discontinuities and assigns a unique segid (and optionally chain letter) per topologically connected fragment; use `fields=("chain", "segid")` to keep both fields consistent.
116+
- {py:func}`~moleculekit.tools.autosegment.autoSegment` detects backbone discontinuities from atomic coordinates and assigns a unique segid (and optionally chain letter) per backbone-continuous segment; use `fields=("chain", "segid")` to keep both fields consistent.
117117
- {py:meth}`~moleculekit.molecule.Molecule.mutateResidue` with `sel` and `newres` swaps a residue's sidechain using Dunbrack rotamer selection: `rotamer_mode="best"` minimises VdW clashes against neighbours, `rotamer_mode="random"` samples by probability for speed. Add `minimize=True` to relax residual strain with OpenMM.
118118
- {py:func}`~moleculekit.tools.modelling.model_gaps` fills missing residues by sequence using the ProMod3 loop-modelling engine — but it requires the ProMod3 Singularity image; there is no fallback.
119119

moleculekit/share/atomselect/atomselect.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,8 @@
7676
"TIP2",
7777
"TIP3",
7878
"TIP4",
79-
"SPC"
79+
"SPC",
80+
"DOD"
8081
],
8182
"nucleic_backbone_names": [
8283
"P",

moleculekit/tools/atomtyper.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -121,7 +121,7 @@ def prepareProteinForAtomtyping(
121121
mol : Molecule object
122122
The prepared Molecule
123123
"""
124-
from moleculekit.tools.autosegment import autoSegment2
124+
from moleculekit.tools.autosegment import autoSegment
125125
from moleculekit.util import sequenceID
126126

127127
mol = mol.copy()
@@ -157,7 +157,7 @@ def prepareProteinForAtomtyping(
157157
from moleculekit.tools.preparation import systemPrepare
158158

159159
if np.all(protmol.segid == "") and np.all(protmol.chain == ""):
160-
protmol = autoSegment2(
160+
protmol = autoSegment(
161161
protmol, fields=("segid", "chain"), basename="K", _logger=verbose
162162
) # We need segments to prepare the protein
163163
protmol, _ = systemPrepare(
@@ -172,7 +172,7 @@ def prepareProteinForAtomtyping(
172172
# TODO: Should we remove bonds between metals and protein?
173173

174174
if segment:
175-
protmol = autoSegment2(
175+
protmol = autoSegment(
176176
protmol, fields=("segid", "chain"), _logger=verbose
177177
) # Reassign segments after preparation
178178

@@ -241,20 +241,22 @@ def atomtypingValidityChecks(mol):
241241

242242
if np.all(mol.segid == "") or np.all(mol.chain == ""):
243243
raise RuntimeError(
244-
"Please assign segments to the segid and chain fields of the molecule using autoSegment2"
244+
"Please assign segments to the segid and chain fields of the molecule using autoSegment"
245245
)
246246

247-
from moleculekit.tools.autosegment import autoSegment2
247+
from moleculekit.tools.autosegment import autoSegment
248248

249249
mm = mol.copy()
250-
mm.segid[:] = "" # Set segid and chain to '' to avoid name clashes in autoSegment2
250+
mm.segid[:] = "" # Set segid and chain to '' to avoid name clashes in autoSegment
251251
mm.chain[:] = ""
252-
refmol = autoSegment2(mm, fields=("chain", "segid"), _logger=False)
252+
refmol = autoSegment(
253+
mm, sel="protein or resname ACE NME", fields=("chain", "segid"), _logger=False
254+
)
253255
numsegsref = len(np.unique(refmol.segid))
254256
numsegs = len(np.unique(mol.segid))
255257
if numsegs != numsegsref:
256258
raise RuntimeError(
257-
"The molecule contains {} segments while we predict {}. Make sure you used autoSegment2 on the protein".format(
259+
"The molecule contains {} segments while we predict {}. Make sure you used autoSegment on the protein".format(
258260
numsegs, numsegsref
259261
)
260262
)

0 commit comments

Comments
 (0)