You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: paper/paper.md
+13-13Lines changed: 13 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,20 +20,20 @@ bibliography: paper.bibtex
20
20
21
21
# Summary
22
22
23
-
The analysis of biomolecular structures is a crucial task for a wide range of applications ranging from drug design to protein engineering. The Protein Data Bank (PDB) file format [@pdb] is the most popular format to describe biomolecular structures such as proteins and nucleic acids. In this text-based format, each line represents a given atom and entails its main properties such as atom name and identifier, residue name and identifier, chain identifier, coordinates, etc. Several solutions have been developed to parse PDB files in dedicated objects that facilitate the analysis and manipulation of biomolecular structures. This is, for example, the case of the ``BioPython`` parser [@biopython,@biopdb] that loads PDB files in a nested dictionary whose structure mimics the hierarchical nature of the biomolecular structure. Selecting a given sub-part of the biomolecule can then be done by going through the dictionary and selecting the required atoms. Other packages, such as ``ProDy`` [@prody], ``BioJava`` [@biojava], ``MMTK`` [@mmtk] and ``MDAnalysis`` [@mdanalysis] to cite a few, also offer solutions to parse PDB files. However, these parsers are embedded in large codebases that are sometimes difficult to integrate with new applications and are often geared toward the analysis of molecular dynamics simulations. Light-weight applications such as ``pdb-tools`` [@pdbtools] lack the capabilities to manipulate coordinates.
23
+
The analysis of biomolecular structures is a crucial task for a wide range of applications ranging from drug design to protein engineering. The Protein Data Bank (PDB) file format [@pdb] is the most popular format to describe biomolecular structures such as proteins and nucleic acids. In this text-based format, each line represents a given atom and entails its main properties such as atom name and identifier, residue name and identifier, chain identifier, coordinates, etc. Several solutions have been developed to parse PDB files into dedicated objects that facilitate the analysis and manipulation of biomolecular structures. This is, for example, the case for the ``BioPython`` parser [@biopython,@biopdb] that loads PDB files into a nested dictionary, the structure of which mimics the hierarchical nature of the biomolecular structure. Selecting a given sub-part of the biomolecule can then be done by going through the dictionary and selecting the required atoms. Other packages, such as ``ProDy`` [@prody], ``BioJava`` [@biojava], ``MMTK`` [@mmtk] and ``MDAnalysis`` [@mdanalysis] to cite a few, also offer solutions to parse PDB files. However, these parsers are embedded in large codebases that are sometimes difficult to integrate with new applications and are often geared toward the analysis of molecular dynamics simulations. Lightweight applications such as ``pdb-tools`` [@pdbtools] lack the capabilities to manipulate coordinates.
24
24
25
25
26
26
27
-
We present here the Python package ``pdb2sql``, which loads individual PDB files in a relational database. Among different solutions the Structured Query Language (SQL) is a very popular solution to query a given database. However SQL queries are complex and domain scientists such as bioinformaticians are usually not familiar with them. This represents an important barrier for the adoption of SQL technology in bioinformatics. ``pdb2sql`` exposes complex SQL queries through simple Python methods that are intuitive for end users. As such, our package leverages the power of SQL queries and remove the barrier that SQL complexity represents. In addition, several advanced modules have also been built, for example to rotate or translate biomolecular structures, to characterize interface contacts, and to measure structure similarity between two protein complexes. Additional modules can easily be developed following the same scheme. As a consequence, ``pdb2sql`` is a light-weight and versatile PDB tool that is easy to extend and to integrate with new applications.
27
+
We present here the Python package ``pdb2sql``, which loads individual PDB files into a relational database. Among different solutions, the Structured Query Language (SQL) is a very popular solution to query a given database. However SQL queries are complex and domain scientists such as bioinformaticians are usually not familiar with them. This represents an important barrier to the adoption of SQL technology in bioinformatics. ``pdb2sql`` exposes complex SQL queries through simple Python methods that are intuitive for end users. As such, our package leverages the power of SQL queries and removes the barrier that SQL complexity represents. In addition, several advanced modules have also been built, for example, to rotate or translate biomolecular structures, to characterize interface contacts, and to measure structure similarity between two protein complexes. Additional modules can easily be developed following the same scheme. As a consequence, ``pdb2sql`` is a lightweight and versatile PDB tool that is easy to extend and to integrate with new applications.
28
28
29
29
30
30
# Capabilities of ``pdb2sql``
31
31
32
-
``pdb2sql`` allows to query, manipulate and process PDB files through a series of dedicated classes. We give an overview of these features and illustrate them with snippets of code. More examples can be found in the documentation (https://pdb2sql.readthedocs.io).
32
+
``pdb2sql`` allows a user to query, manipulate, and process PDB files through a series of dedicated classes. We give an overview of these features and illustrate them with snippets of code. More examples can be found in the documentation (https://pdb2sql.readthedocs.io).
33
33
34
34
## Extracting data from PDB files
35
35
36
-
``pdb2sql`` allows to simply query the database using the ``get(attr, **kwargs)`` method. The attribute ``attr``is here a list of or a single column name of the ``SQL`` database, see Table 1 for available attributes. The keyword argument ``kwargs`` can then be used to specify a sub-selection of atoms.
36
+
``pdb2sql`` allows a user to simply query the database using the ``get(attr, **kwargs)`` method. The attribute ``attr`` here is a list of or a single column name of the ``SQL`` database; see Table 1 for available attributes. The keyword argument ``kwargs`` can then be used to specify a sub-selection of atoms.
37
37
38
38
Table 1. Atom attributes and associated definitions in ``pdb2sql``
39
39
@@ -55,7 +55,7 @@ Table 1. Atom attributes and associated definitions in ``pdb2sql``
55
55
| model | Model serial number |
56
56
57
57
58
-
Every attribute name can be used to select specific atoms and multiple conditions can be easily combined. For example, let's consider the following example:
58
+
Every attribute name can be used to select specific atoms and multiple conditions can be easily combined. For example, let's consider the following example:
59
59
60
60
```python
61
61
from pdb2sql import pdb2sql
@@ -66,7 +66,7 @@ atoms = pdb.get('x,y,z',
66
66
chainID='A')
67
67
```
68
68
69
-
This snippet extracts the coordinates of the carbon and hydrogen atoms that belong to all the valine and leucine residues of the chain labelled `A` in the PDB file. Atoms can also be excluded from the selection by appending the prefix ``no_`` to the attribute name. This is the case in the following example:
69
+
This snippet extracts the coordinates of the carbon and hydrogen atoms that belong to all the valine and leucine residues of the chain labelled `A` in the PDB file. Atoms can also be excluded from the selection by appending the prefix ``no_`` to the attribute name. This is the case in the following example:
70
70
71
71
```python
72
72
from pdb2sql import pdb2sql
@@ -78,7 +78,7 @@ This snippet extracts the atom and residue names of all atoms except those belon
78
78
79
79
## Manipulating PDB files
80
80
81
-
The data contained in the SQL database can also be modified using the ``update(attr, vals, **kwargs)`` method. The attributes and keyword arguments are identical to those in the ``get`` method. The ``vals`` argument should contain a `numpy` array whose dimension should match the selection criteria. For example:
81
+
The data contained in the SQL database can also be modified using the ``update(attr, vals, **kwargs)`` method. The attributes and keyword arguments are identical to those in the ``get`` method. The ``vals`` argument should contain a `numpy` array whose dimension should match the selection criteria. For example:
The ``interface`` class is derived from the ``pdb2sql`` class and offers functionalities to identify contact atoms or residues between two different chains with a given contact distance. It is useful for extracting and analysing the interface of e.g. protein-protein complexes. The following example snippet returns all the atoms and all the residues of the interface of '1AK4.pdb' defined by a contact distance of 6 Å.
116
+
The ``interface`` class is derived from the ``pdb2sql`` class and offers functionality to identify contact atoms or residues between two different chains with a given contact distance. It is useful for extracting and analysing the interface of, e.g., protein-protein complexes. The following example snippet returns all the atoms and all the residues of the interface of '1AK4.pdb' defined by a contact distance of 6 Å.
117
117
118
118
```python
119
119
from pdb2sql import interface
@@ -138,7 +138,7 @@ res = pdbitf.get_contact_residues(cutoff=6.0)
138
138
139
139
## Computing Structure Similarity
140
140
141
-
The ``StructureSimilarity`` class allows to compute similarity measures between two protein-protein complexes. Several popular measures used to classify qualities of protein complex structures in the CAPRI (Critical Assessment of PRedicted Interactions) challenges [@capri] have been implemented: interface rmsd, ligand rmsd, fraction of native contacts and DockQ[@dockq]. The approach implemented to compute the interface rmsd and ligand rmsd is identical to the well-known package ``ProFit``[@profit]. All the methods required to superimpose structures have been implemented in the ``transform`` class and therefore relies on no external dependencies. The following snippet shows how to compute these measures:
141
+
The ``StructureSimilarity`` class allows a user to compute similarity measures between two protein-protein complexes. Several popular measures used to classify qualities of protein complex structures in the CAPRI (Critical Assessment of PRedicted Interactions) challenges [@capri] have been implemented: interface rmsd, ligand rmsd, fraction of native contacts and DockQ[@dockq]. The approach implemented to compute the interface rmsd and ligand rmsd is identical to the well-known package ``ProFit``[@profit]. All the methods required to superimpose structures have been implemented in the ``transform`` class and therefore this relies on no external dependencies. The following snippet shows how to compute these measures:
``psb2sql`` has been used at the Netherlands eScience center for bioinformatics projects. This is, for example, the case of ``iScore``[@iscore] that uses graph kernels and support vector machines to rank protein-protein interface. We illustrate here the use of the package by computing the interface rmsd and ligand rmsd of a series of structural models using the experimental structure as a reference. This is a common task for protein-protein docking where a large number of docked conformations are generated and have then to be compared to ground truth to identify the best-generated poses. This calculation is usually done using the ProFit software and we, therefore, compare our results with those obtained with ProFit. The code does compute the similarity measure for different decoys is simple:
157
+
``psb2sql`` has been used at the Netherlands eScience center for bioinformatics projects. This is, for example, the case of ``iScore``[@iscore], which uses graph kernels and support vector machines to rank protein-protein interfaces. We illustrate the use of the package here by computing the interface rmsd and ligand rmsd of a series of structural models using the experimental structure as a reference. This is a common task for protein-protein docking, where a large number of docked conformations are generated and have then to be compared to ground truth to identify the best-generated poses. This calculation is usually done using the ProFit software and we, therefore, compare our results with those obtained with ProFit. The code to compute the similarity measure for different decoys is simple:
Note that the method will compute the i-zone, i.e. the zone of the proteins that form the interface in a similar way than ProFit. This is done for the first calculations and the i-zone is then reused for the subsequent calculations. The comparison of our interface rmsd values to those given by ProFit is shown in Fig 1.
171
+
Note that the method will compute the i-zone, i.e., the zone of the proteins that form the interface in a similar way to ProFit. This is done for the first calculations and the i-zone is then reused for the subsequent calculations. The comparison of our interface rmsd values to those given by ProFit is shown in Fig 1.
172
172
173
173

174
174
Figure 1. Left - Superimposed model (green) and reference (cyan) structures. Right - comparison of interface rmsd values given by `pdb2sql` and by `ProFit`.
175
175
176
176
# Acknowledgements
177
-
We acknowledge contributions from Li Xue, Sonja Georgievska and Lars Ridder.
177
+
We acknowledge contributions from Li Xue, Sonja Georgievska, and Lars Ridder.
0 commit comments