-
Notifications
You must be signed in to change notification settings - Fork 1
Structural superposition
Home > Superposition
A weekly process generates superposed UniProt segments for the whole PDB archive. These superpositions are made available on the aggregated views of proteins, where clustered, superposed PDB chains can be displayed using the interactive 3D molecular viewer, Mol*.
This page describes the process of superposing and clustering the PDB chains and provides examples of data visualisation.
A segment is a UniProt sequence region that maps to one or more overlapping PDB structural chains. Each segment has ‘start’ and ‘end’ values based on its UniProt sequence numbering. A segment can be discontinuous, as a PDB chain can map to discontinuous regions of a UniProt sequence. To illustrate visually, the image below highlights three different segments of a protein.
Segments are derived by computing a pairwise dissimilarity matrix of overlaps between PDB chains. The score for each pair of PDB chains is denoted by 1-Jaccard score, where Jaccard score is the number of common residues divided by the number of residues in the union of the two chains. PDB chains are then clustered into a hierarchical tree and collated into segments based on a cutoff score.
A cluster is a collection of individual peptide chains within each segment that, upon structural alignment, fits within an acceptable range of a similarity score. In other words, clusters are groups of protein chains in a similar conformational state, based on model coordinates.
Similarity scores are computed by calculating the CA distance between each residue per chain in the segment, creating a matrix of distances. For example, if a segment has 5 chains, each with exactly 150 residues, 5 CA distance matrices (CDMs) of dimension 150*150 are produced. Chain-chain comparisons are made by computing the absolute difference between elements in the two chains' CDMs, creating a distance-difference matrix (DDM). Finally, we compute a score by summating the upper triangle of each unique DDM. To account for possible gaps in the DDMs, caused by a lack of residue coordinates, these scores are multiplied by a scaler between 0-1, where 1 represents the absence of any gaps. In summary, low scores represent chains with high structural similarity.
All peptide chains within a segment were structurally aligned in 3D space using GESAMT (Krissinel & Henrick, 2004). Since 19th September 2022, GESAMT is only used for superimposing chains and is not required for clustering. The average-linkage clustering algorithm (UPGMA) is used to cluster the scores for each pair of chains in the segment. A fixed cutoff value of 70 % of the parent node's score is applied to separate the chains into clusters (e.g. cutoff for segment with parent node at 2100 Angstroms is 1470 Angstroms).
A good example to illustrate structural clusters is P21980, which has a single segment mapped to 19 different PDB chains.
Above: Some of the structures mapped to P21980 contain unobserved regions (the grey gaps between the blue regions). Due to the assignment of overlapping regions into a segment, all of the chains for this protein fall into one segment.
The PDB chains were structurally aligned using GESAMT and the pairwise CA distance-based similarity was used to cluster the chains, illustrated by this dendrogram:
At the time of writing, this example shows that the segment can be clustered into two conformations, one containing four chains (green in Image 3), and another containing 15 (red in Image 3). This simply denotes that these two clusters are divergent enough based on the CA distance-based similarity scores between contained chains.
Each week, the clustering results are updated if newly deposited structures (or revised models) are published to the PDB. This can sometimes change the clustering results if new the structure(s) are considerably different to the existing models, potentially illuminating new states to the protein's conformational landscape.
The 3D viewer (Mol*) on the PDBe-KB protein page shows the superposition per cluster. Each cluster has one representative chain, chosen based on the model quality, resolution, observed residues ratio and UniProt sequence coverage.
Above: Illustration of how cluster annotations from the dendrogram map to the labels in the Mol* viewer. Orange and green boxes have been added to highlight the equivalence between annotations.
All chains within a given segment can be viewed in Mol*. The segment definition can be changed by clicking on the residue range in the top left of the window (see "Select Segment"). By default, only one representative chain is displayed per cluster but this can be toggled using the drop-down menus on the left-hand side of the window. The appearance of the protein can be modified by clicking the "Components" tab on the right-hand side of the window and adjusting Mol*'s built-in appearance options.
Above: Clusters as seen on Mol*. The 3D viewer shows the superposition of representatives of each cluster on the first-page load.
Since October 2022, the Alphafold structure for the viewed UniProt sequence can be structurally aligned to the experimentally determined chains. On the viewer's right-hand side, click the "Load AlphaFold structure" button. The Alphafold structure will be automatically aligned to the cluster representatives, with RMSD displayed under the "AlphaFold Superposition" tab. The structure prediction confidence score (pLDDT) is displayed on the chain as marine blue (high confidence) to red (low confidence), with sliders on the right-hand side to toggle pLDDT cutoff and opacity. The predicted alignment error (PAE) matrix is also displayed.
Above: Alphafold structure superposed to the representatives from each cluster. Representative chains have been hidden for clarity.
The source code for AlphaFold structural alignment tools is located in the PDBe Mol* repository.
- Improved management of UniProt accessions with numerous corresponding PDB entries, which were previously causing memory or runtime issues by serialising matrix objects (saved to disk) and storing only references to them in memory. This reduced memory usage below 1 GB for clustering.
- Implemented parallel computation for resource-intensive linear algebra operations, utilising ten threads in the production environment.
- Introduced a selective recomputation process for only new and updated PDB entries, significantly reducing runtime.
- Refactored the codebase to separate dendrogram rendering from the clustering process to tackle memory leak issues.
- Devised strategies to handle the limitations of GESAMT in superposing certain protein chains. Retry with SSM if GESAMT fails; the process continues to cluster chains into conformational states regardless of superposition success.
- Corrected mislabeled segment definitions and integrated shorter segments into larger ones when the latter entirely encompasses the former.
- Migrated the clustering repository from GitLab to GitHub, making the clustering code open source under the Apache-2.0 license.
- Introduced an instructional notebook in the repository to aid users and planned to make the clustering package available as a PyPi application.
- Upgraded the clustering process from Python3.7 to Python3.10 and enhanced logs for more insightful error reporting.
- Shifted from the Q-score method to a new C-alpha distance-based score to determine conformational states, improving separation of structurally diverse chains between clusters. This is the GLOCON score in the new clustering process, with further details to be shared in an upcoming publication.
- Modularised the pipeline by separating the clustering process from superposition and optimised matrix storage.
- Began using updated mmCIFs for clustering and superposition, providing per-residue UniProt sequence mappings for valid chain-chain comparisons.
- Enabled the option to include UniProt's AlphaFold structure in clustering for conformational state recognition, although it is turned off by default in production.
- Developed a curated dataset of monomeric protein conformational states for benchmarking clustering improvements. This dataset is available on Kaggle and our FTP server.
- Last update before the major revision to the clustering process.
PDBe-KB 2024

