You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Refactor PDB training docs and add mirrors guide (clean up README)
Moved detailed instructions for training on the PDB from README.md to a new docs/mirrors.rst file for better organization. Updated docs/index.rst to include the new mirrors section. Cleaned up and clarified contributor_guide.rst, removing visual aids and improving formatting. Minor fix in installation.rst for consistency.
Copy file name to clipboardExpand all lines: README.md
-172Lines changed: 0 additions & 172 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -97,8 +97,6 @@ For more advanced setup options (including how to run workflows via apptainers)
97
97
98
98
This section contains information for how to get atomworks set up and a quick guide for using some of the features of atomworks.io to parse PDB files. To learn more about the features in atomworks.io and atomworks.ml, see the [external documentation](https://rosettacommons.github.io/atomworks/latest/).
99
99
100
-
### 1. Quick Start
101
-
102
100
To parse a pdb file (parse = load, clean, annotate relevant metadata such as entities, molecules, etc) you can use the `parse` function:
103
101
104
102
> Note: To run the code in this section you will need to download the 3nez.cif.gz file yourself. See the [examples](https://rosettacommons.github.io/atomworks/latest/auto_examples/index.html) for how to download files from the PDB within a Python script.
@@ -138,176 +136,6 @@ from biotite.structure import AtomArray
138
136
atom_array: AtomArray = load_any("3nez.cif.gz", model=1) # model=1 means that we want to load the model 1 (i.e. the first model) rather than a stack of all models in the file
139
137
```
140
138
141
-
### 2. Training on the PDB
142
-
143
-
> ⚠️ **Disclaimer:** Documentation for this section is currently under construction. Please check back soon for updates!
144
-
145
-
**Step 1 — Mirror the PDB (mmCIFs)**
146
-
To train on the PDB, you first need to make sure you have access to the samples form the PDB. We use `mmCIF` files as the highly recommended format for training.
147
-
For convenience, we provide a command to mirror the PDB:
148
-
149
-
```bash
150
-
# Full mirror (~100 GB)
151
-
atomworks pdb sync /path/to/pdb_mirror # This will create a carbon-copy of the PDB, dated today, in the specified directory. It will download the .mmcif files in the same sharding pattern as the original PDB and keep them gzipped for efficiency.
152
-
153
-
# # If, for some reason you only want to download specific IDs, the CLI also supports this:
154
-
# atomworks pdb sync /path/to/pdb_mirror --pdb-id 1A0I --pdb-id 7XYZ # This will only download the specified PDB IDs.
155
-
# # or
156
-
# atomworks pdb sync /path/to/pdb_mirror --pdb-ids-file /path/to/ids.txt # This will download the PDB IDs listed in the file, one per line. Each line should be a PDB ID (e.g. '6lyz') and separated by a newline.
157
-
```
158
-
159
-
Once the mirror is created, set the environment variable:
160
-
161
-
```bash
162
-
export PDB_MIRROR_PATH=/path/to/pdb_mirror
163
-
```
164
-
165
-
To have this more permanent, you can add it to a `.env` file in your home directory. Here is an [example of a `.env`](.env.sample) file structure that you can copy, rename to `.env` and edit with your own paths.
166
-
167
-
**Step 2 — Get PDB metadata (PN units and interfaces)**
168
-
To calculate sampling probabilities and filter examples for splits, we pre-process the PDB with metadata for each PDB entry.
169
-
To save you the work, we provide pre-computed metadata (dated July 15/2025) for downloading:
170
-
171
-
```bash
172
-
atomworks setup metadata /path/to/metadata # This will download the metadata (as .tar.gz) and extract it to the specified directory.
173
-
```
174
-
175
-
This produces parquet files at:
176
-
177
-
-`/path/to/metadata/pn_units_df.parquet` — Contains metadata for each *PN unit* in the PDB. The term *pn unit* is shorthand for `polymer XOR non-polymer unit` and behaves for almost all purposes like the `chain` in a PDB file. The only difference is that a ligand composed of multiple covalently bonded ligands is considered a single PN unit (whilst it would be multiple chains in a PDB file). Effectively this `.parquet` is a large table of all individual chains, ligands, etc (to be precise, it has one entry per pn unit) in the PDB that includes helpful metadata for filtering and sampling.
178
-
-`/path/to/metadata/interfaces_df.parquet` — Contains metadata for each interface in the PDB. This `.parquet` is a large table of all binary interfaces in the PDB. It lists each interface as (pn_unit_1, pn_unit_2) pairs and includes helpful metadata for filtering and sampling.
179
-
180
-
Alternatively, you can generate fresher metadata yourself (scripts will be uploaded in the coming weeks).
181
-
182
-
**Step 3 — Configure an AF3-style dataset (example: train only on D-polypeptides)**
183
-
Next we need to use the metadata to configure a dataset that we would like to sample from. This includes e.g. training cut-off, filters, transforms to apply, etc.
184
-
Here's a simple example that:
185
-
186
-
- Filters to D-polypeptide and L-polypeptide chains only (`POLYPEPTIDE_D` and `POLYPEPTIDE_L` -- to include additional chain types, replace the lists with the appropriate IDs (see [mapping](./src/atomworks/enums.py#L31-L45) in comments).
187
-
- Excludes ligands in the AF3 list of excluded ligands, available at [`atomworks.io.constants.AF3_EXCLUDED_LIGANDS_REGEX`](./src/atomworks/constants.py#L350).
188
-
189
-
```yaml
190
-
# NOTE: The below is a hydra config and the _target_ fields are the hydra syntax for instantiating a class.
191
-
# You can use this without hyrda, but will then instead need to provide the corresponding arguments for the
192
-
# _target_ objects directly.
193
-
194
-
# Chain type ids used below (from atomworks.enums.ChainType):
- "~(pn_unit_1_non_polymer_res_names.notnull() and pn_unit_1_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
274
-
- "~(pn_unit_2_non_polymer_res_names.notnull() and pn_unit_2_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
275
-
columns_to_load: null
276
-
cif_parser_args:
277
-
cache_dir: null
278
-
save_failed_examples_to_dir: null
279
-
```
280
-
281
-
**Step 4 — MSAs (optional)**
282
-
We are working on a way to make MSAs accessible to the public, but due to the large storage requirements (multiple TB) we are still working on this. If your organization has interest & capacity to host the MSAs, please contact us. In the meantime, if you have MSAs (e.g., from OpenProteinSet) you can configure the pipeline to use them like so:
You now have a full fledged dataset that you can use to train models on! If you want to just try this out without having to download the whole PDB and the metdatada, you can instead run our tests which have a mini-mockup of the pipeline with real pdb files, metadata, distillation data, templates and MSAs for the example of AF3. You can download all this relevant metadata via the atomworks CLI:
295
-
296
-
> Note: Make sure you are in the AtomWorks root directory when you run the following command, otherwise a new tests/data folder will be created in your current working directory.
297
-
298
-
```bash
299
-
atomworks setup tests # This will download the test pack to `tests/data` and unpack it there (~500 MB).
300
-
```
301
-
302
-
You will now have a mini PDB at `tests/data/pdb` and a mini custom CCD at `tests/data/ccd`. MSA and template data is in `tests/data/shared` and the distillation and metadata are in `data/ml/af2_distillation`, `data/ml/pdb_pn_units` and `data/ml/pdb_interfaces`. A dataset that uses all of these is [for example here](./tests/ml/conftest.py#L300).
303
-
304
-
To run the tests for the various datasets, you can run the following command:
305
-
306
-
```bash
307
-
# Make sure you have the correct environment activated, and set your paths correctly in the .env file / shell environment variables (see points above)
Copy file name to clipboardExpand all lines: docs/contributor_guide.rst
+8-16Lines changed: 8 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,10 +11,14 @@ As you code
11
11
-------------
12
12
13
13
1. **Reduce cognitive overhead:**
14
+
14
15
a. Pick meaningful, descriptive variable names.
16
+
15
17
b. Write docstrings (leverage AI!) and comments. To be used in the API documentation the docstring should
16
-
follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
18
+
follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
19
+
17
20
c. Follow the `Python Zen <https://peps.python.org/pep-0020/>`_ – explicit is better than implicit, etc.
21
+
18
22
2. **Write tests.**
19
23
20
24
As you commit
@@ -62,20 +66,8 @@ To build the documentation, navigate to the ``docs`` directory and run:
62
66
If you are new to Sphinx, please refer to the `Sphinx documentation <https://www.sphinx-doc.org/en/master/>`_ for guidance on writing and formatting documentation.
63
67
All of the documentation is written in reStructuredText (reST) format. For more information on reST, see the `reStructuredText Primer <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`_.
- `Best Practices for Code Review | SmartBear <https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/>`_
81
73
@@ -90,7 +82,7 @@ PR Hygiene
90
82
When contributing to this repository, please follow these steps:
91
83
92
84
1. Clone the repository
93
-
2. Create the development environment (see the :ref:`Local Conda Environment<local-conda_environment>` section in the Installation Guide).
85
+
2. Create the development environment (see the *Local Conda Environment* section in the Installation Guide).
94
86
3. Create a new branch for your changes.
95
87
- Use the following convention to name your branch: ``<category>/<description>``. Categories: ``feat``, ``fix``, ``hotfix``, ``refactor``, ``docs``, ``perf``.
0 commit comments