Skip to content

Commit 56a1a24

Browse files
rclunenscorley
authored andcommitted
Refactor PDB training docs and add mirrors guide (clean up README)
Moved detailed instructions for training on the PDB from README.md to a new docs/mirrors.rst file for better organization. Updated docs/index.rst to include the new mirrors section. Cleaned up and clarified contributor_guide.rst, removing visual aids and improving formatting. Minor fix in installation.rst for consistency.
1 parent 0e32354 commit 56a1a24

6 files changed

Lines changed: 180 additions & 191 deletions

File tree

README.md

Lines changed: 0 additions & 172 deletions
Original file line numberDiff line numberDiff line change
@@ -97,8 +97,6 @@ For more advanced setup options (including how to run workflows via apptainers)
9797

9898
This section contains information for how to get atomworks set up and a quick guide for using some of the features of atomworks.io to parse PDB files. To learn more about the features in atomworks.io and atomworks.ml, see the [external documentation](https://rosettacommons.github.io/atomworks/latest/).
9999

100-
### 1. Quick Start
101-
102100
To parse a pdb file (parse = load, clean, annotate relevant metadata such as entities, molecules, etc) you can use the `parse` function:
103101

104102
> Note: To run the code in this section you will need to download the 3nez.cif.gz file yourself. See the [examples](https://rosettacommons.github.io/atomworks/latest/auto_examples/index.html) for how to download files from the PDB within a Python script.
@@ -138,176 +136,6 @@ from biotite.structure import AtomArray
138136
atom_array: AtomArray = load_any("3nez.cif.gz", model=1) # model=1 means that we want to load the model 1 (i.e. the first model) rather than a stack of all models in the file
139137
```
140138

141-
### 2. Training on the PDB
142-
143-
> ⚠️ **Disclaimer:** Documentation for this section is currently under construction. Please check back soon for updates!
144-
145-
**Step 1 — Mirror the PDB (mmCIFs)**
146-
To train on the PDB, you first need to make sure you have access to the samples form the PDB. We use `mmCIF` files as the highly recommended format for training.
147-
For convenience, we provide a command to mirror the PDB:
148-
149-
```bash
150-
# Full mirror (~100 GB)
151-
atomworks pdb sync /path/to/pdb_mirror # This will create a carbon-copy of the PDB, dated today, in the specified directory. It will download the .mmcif files in the same sharding pattern as the original PDB and keep them gzipped for efficiency.
152-
153-
# # If, for some reason you only want to download specific IDs, the CLI also supports this:
154-
# atomworks pdb sync /path/to/pdb_mirror --pdb-id 1A0I --pdb-id 7XYZ # This will only download the specified PDB IDs.
155-
# # or
156-
# atomworks pdb sync /path/to/pdb_mirror --pdb-ids-file /path/to/ids.txt # This will download the PDB IDs listed in the file, one per line. Each line should be a PDB ID (e.g. '6lyz') and separated by a newline.
157-
```
158-
159-
Once the mirror is created, set the environment variable:
160-
161-
```bash
162-
export PDB_MIRROR_PATH=/path/to/pdb_mirror
163-
```
164-
165-
To have this more permanent, you can add it to a `.env` file in your home directory. Here is an [example of a `.env`](.env.sample) file structure that you can copy, rename to `.env` and edit with your own paths.
166-
167-
**Step 2 — Get PDB metadata (PN units and interfaces)**
168-
To calculate sampling probabilities and filter examples for splits, we pre-process the PDB with metadata for each PDB entry.
169-
To save you the work, we provide pre-computed metadata (dated July 15/2025) for downloading:
170-
171-
```bash
172-
atomworks setup metadata /path/to/metadata # This will download the metadata (as .tar.gz) and extract it to the specified directory.
173-
```
174-
175-
This produces parquet files at:
176-
177-
- `/path/to/metadata/pn_units_df.parquet` — Contains metadata for each *PN unit* in the PDB. The term *pn unit* is shorthand for `polymer XOR non-polymer unit` and behaves for almost all purposes like the `chain` in a PDB file. The only difference is that a ligand composed of multiple covalently bonded ligands is considered a single PN unit (whilst it would be multiple chains in a PDB file). Effectively this `.parquet` is a large table of all individual chains, ligands, etc (to be precise, it has one entry per pn unit) in the PDB that includes helpful metadata for filtering and sampling.
178-
- `/path/to/metadata/interfaces_df.parquet` — Contains metadata for each interface in the PDB. This `.parquet` is a large table of all binary interfaces in the PDB. It lists each interface as (pn_unit_1, pn_unit_2) pairs and includes helpful metadata for filtering and sampling.
179-
180-
Alternatively, you can generate fresher metadata yourself (scripts will be uploaded in the coming weeks).
181-
182-
**Step 3 — Configure an AF3-style dataset (example: train only on D-polypeptides)**
183-
Next we need to use the metadata to configure a dataset that we would like to sample from. This includes e.g. training cut-off, filters, transforms to apply, etc.
184-
Here's a simple example that:
185-
186-
- Filters to D-polypeptide and L-polypeptide chains only (`POLYPEPTIDE_D` and `POLYPEPTIDE_L` -- to include additional chain types, replace the lists with the appropriate IDs (see [mapping](./src/atomworks/enums.py#L31-L45) in comments).
187-
- Excludes ligands in the AF3 list of excluded ligands, available at [`atomworks.io.constants.AF3_EXCLUDED_LIGANDS_REGEX`](./src/atomworks/constants.py#L350).
188-
189-
```yaml
190-
# NOTE: The below is a hydra config and the _target_ fields are the hydra syntax for instantiating a class.
191-
# You can use this without hyrda, but will then instead need to provide the corresponding arguments for the
192-
# _target_ objects directly.
193-
194-
# Chain type ids used below (from atomworks.enums.ChainType):
195-
# 0=CyclicPseudoPeptide, 1=OtherPolymer, 2=PeptideNucleicAcid,
196-
# 3=DNA, 4=DNA_RNA_HYBRID, 5=POLYPEPTIDE_D, 6=POLYPEPTIDE_L, 7=RNA,
197-
# 8=NON_POLYMER, 9=WATER, 10=BRANCHED, 11=MACROLIDE
198-
199-
af3_pdb_dataset:
200-
_target_: atomworks.ml.datasets.datasets.ConcatDatasetWithID
201-
datasets:
202-
# Single PN units
203-
- _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
204-
dataset_parser:
205-
_target_: atomworks.ml.datasets.parsers.PNUnitsDFParser
206-
transform:
207-
_target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
208-
is_inference: false
209-
n_recycles: 5 # This means that we will subsample 5 random sets from the MSA for each example.
210-
crop_size: 256
211-
crop_contiguous_probability: 0.3333333333333333
212-
crop_spatial_probability: 0.6666666666666666
213-
diffusion_batch_size: 32
214-
# Optional templates (if available)
215-
template_lookup_path: ${paths.shared}/template_lookup.csv
216-
template_base_dir: ${paths.shared}/template
217-
# Optional MSAs (see Step 4)
218-
# protein_msa_dirs:
219-
# - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
220-
# rna_msa_dirs:
221-
# - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
222-
dataset:
223-
_target_: atomworks.ml.datasets.datasets.PandasDataset
224-
name: pn_units
225-
id_column: example_id
226-
data: /path/to/metadata/pn_units_df.parquet
227-
filters:
228-
- "deposition_date < '2022-01-01'"
229-
- "resolution < 5.0 and ~method.str.contains('NMR')"
230-
- "num_polymer_pn_units <= 20"
231-
- "cluster.notnull()"
232-
- "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
233-
# Train only on D-polypeptides:
234-
- "q_pn_unit_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
235-
# Exclude ligands from AF3 excluded set:
236-
- "~(q_pn_unit_non_polymer_res_names.notnull() and q_pn_unit_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
237-
columns_to_load: null
238-
save_failed_examples_to_dir: null
239-
240-
# Binary interfaces
241-
- _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
242-
dataset_parser:
243-
_target_: atomworks.ml.datasets.parsers.InterfacesDFParser
244-
transform:
245-
_target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
246-
is_inference: false
247-
n_recycles: 5
248-
crop_size: 256
249-
crop_spatial_probability: 1.0
250-
crop_contiguous_probability: 0.0
251-
diffusion_batch_size: 32
252-
template_lookup_path: ${paths.shared}/template_lookup.csv
253-
template_base_dir: ${paths.shared}/template
254-
# Optional MSAs (see Step 4)
255-
# protein_msa_dirs:
256-
# - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
257-
# rna_msa_dirs:
258-
# - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
259-
dataset:
260-
_target_: atomworks.ml.datasets.datasets.PandasDataset
261-
name: interfaces
262-
id_column: example_id
263-
data: /path/to/metadata/interfaces_df.parquet
264-
filters:
265-
- "deposition_date < '2022-01-01'"
266-
- "resolution < 5.0 and ~method.str.contains('NMR')"
267-
- "num_polymer_pn_units <= 20"
268-
- "cluster.notnull()"
269-
- "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
270-
# Train only on D-polypeptide interfaces:
271-
- "pn_unit_1_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
272-
- "pn_unit_2_type in [5, 6]" # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
273-
- "~(pn_unit_1_non_polymer_res_names.notnull() and pn_unit_1_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
274-
- "~(pn_unit_2_non_polymer_res_names.notnull() and pn_unit_2_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
275-
columns_to_load: null
276-
cif_parser_args:
277-
cache_dir: null
278-
save_failed_examples_to_dir: null
279-
```
280-
281-
**Step 4 — MSAs (optional)**
282-
We are working on a way to make MSAs accessible to the public, but due to the large storage requirements (multiple TB) we are still working on this. If your organization has interest & capacity to host the MSAs, please contact us. In the meantime, if you have MSAs (e.g., from OpenProteinSet) you can configure the pipeline to use them like so:
283-
284-
```yaml
285-
protein_msa_dirs:
286-
- { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
287-
rna_msa_dirs:
288-
- { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
289-
```
290-
291-
Or alternatively not use MSAs.
292-
293-
**Step 5 — Train a model**
294-
You now have a full fledged dataset that you can use to train models on! If you want to just try this out without having to download the whole PDB and the metdatada, you can instead run our tests which have a mini-mockup of the pipeline with real pdb files, metadata, distillation data, templates and MSAs for the example of AF3. You can download all this relevant metadata via the atomworks CLI:
295-
296-
> Note: Make sure you are in the AtomWorks root directory when you run the following command, otherwise a new tests/data folder will be created in your current working directory.
297-
298-
```bash
299-
atomworks setup tests # This will download the test pack to `tests/data` and unpack it there (~500 MB).
300-
```
301-
302-
You will now have a mini PDB at `tests/data/pdb` and a mini custom CCD at `tests/data/ccd`. MSA and template data is in `tests/data/shared` and the distillation and metadata are in `data/ml/af2_distillation`, `data/ml/pdb_pn_units` and `data/ml/pdb_interfaces`. A dataset that uses all of these is [for example here](./tests/ml/conftest.py#L300).
303-
304-
To run the tests for the various datasets, you can run the following command:
305-
306-
```bash
307-
# Make sure you have the correct environment activated, and set your paths correctly in the .env file / shell environment variables (see points above)
308-
pytest tests/ml/pipelines/test_data_loading_pipelines.py
309-
```
310-
311139
---
312140

313141
## Contribution

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@
4747
]
4848

4949
templates_path = ["_templates"]
50-
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
50+
exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "examples/GALLERY_HEADER.rst"]
5151

5252
# -- Options for HTML output -------------------------------------------------
5353
# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output

docs/contributor_guide.rst

Lines changed: 8 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,14 @@ As you code
1111
-------------
1212

1313
1. **Reduce cognitive overhead:**
14+
1415
a. Pick meaningful, descriptive variable names.
16+
1517
b. Write docstrings (leverage AI!) and comments. To be used in the API documentation the docstring should
16-
follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
18+
follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
19+
1720
c. Follow the `Python Zen <https://peps.python.org/pep-0020/>`_ – explicit is better than implicit, etc.
21+
1822
2. **Write tests.**
1923

2024
As you commit
@@ -62,20 +66,8 @@ To build the documentation, navigate to the ``docs`` directory and run:
6266
If you are new to Sphinx, please refer to the `Sphinx documentation <https://www.sphinx-doc.org/en/master/>`_ for guidance on writing and formatting documentation.
6367
All of the documentation is written in reStructuredText (reST) format. For more information on reST, see the `reStructuredText Primer <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`_.
6468

65-
66-
Visual Aids
67-
-----------
68-
69-
.. image:: _static/best_practices_cognitive_load.png
70-
:alt: Good vs Bad Cognitive Load
71-
:width: 400px
72-
73-
.. image:: _static/best_practices_mental_models.png
74-
:alt: Internalized Mental Models
75-
:width: 300px
76-
77-
More detail
78-
-----------
69+
Other Resources
70+
---------------
7971

8072
- `Best Practices for Code Review | SmartBear <https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/>`_
8173

@@ -90,7 +82,7 @@ PR Hygiene
9082
When contributing to this repository, please follow these steps:
9183

9284
1. Clone the repository
93-
2. Create the development environment (see the :ref:`Local Conda Environment<local-conda_environment>` section in the Installation Guide).
85+
2. Create the development environment (see the *Local Conda Environment* section in the Installation Guide).
9486
3. Create a new branch for your changes.
9587
- Use the following convention to name your branch: ``<category>/<description>``. Categories: ``feat``, ``fix``, ``hotfix``, ``refactor``, ``docs``, ``perf``.
9688
- Example: ``feat/support-rdkit-small-molecule``

docs/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,4 +22,5 @@ Welcome to **atomworks** — a toolkit for converting, parsing, and manipulating
2222
glossary
2323
api_reference
2424
auto_examples/index
25-
contributor_guide
25+
contributor_guide
26+
mirrors

docs/installation.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This is the easiest way to get started with atomworks.
1212
pip install atomworks # base installation version without torch (for only atomworks.io)
1313
pip install "atomworks[ml]" # with torch and ML dependencies (for atomworks.io plus atomworks.ml)
1414
pip install "atomworks[dev]" # with development dependencies
15-
pip install "atomworks[ml,dev]" # with all dependencies
15+
pip install "atomworks[ml,dev]" # with all dependencies"
1616
1717
You can also install AtomWorks with `Open Babel <https://openbabel.org/>`_, an alternative to RDKit:
1818

0 commit comments

Comments
 (0)