Refactor PDB training docs and add mirrors guide (clean up README)

rclune · nscorley · commit 56a1a243cb3c · 2025-11-29T09:53:47.000-08:00
Moved detailed instructions for training on the PDB from README.md to a new docs/mirrors.rst file for better organization. Updated docs/index.rst to include the new mirrors section. Cleaned up and clarified contributor_guide.rst, removing visual aids and improving formatting. Minor fix in installation.rst for consistency.
diff --git a/README.md b/README.md
@@ -97,8 +97,6 @@ For more advanced setup options (including how to run workflows via apptainers)
 
 This section contains information for how to get atomworks set up and a quick guide for using some of the features of atomworks.io to parse PDB files. To learn more about the features in atomworks.io and atomworks.ml, see the [external documentation](https://rosettacommons.github.io/atomworks/latest/). 
 
-### 1. Quick Start
-
 To parse a pdb file (parse = load, clean, annotate relevant metadata such as entities, molecules, etc) you can use the `parse` function:
 
 > Note: To run the code in this section you will need to download the 3nez.cif.gz file yourself. See the [examples](https://rosettacommons.github.io/atomworks/latest/auto_examples/index.html) for how to download files from the PDB within a Python script. 
@@ -138,176 +136,6 @@ from biotite.structure import AtomArray
 atom_array: AtomArray = load_any("3nez.cif.gz", model=1)  # model=1 means that we want to load the model 1 (i.e. the first model) rather than a stack of all models in the file
 ```
 
-### 2. Training on the PDB
-
-> ⚠️ **Disclaimer:** Documentation for this section is currently under construction. Please check back soon for updates!
-
-**Step 1 — Mirror the PDB (mmCIFs)**
-  To train on the PDB, you first need to make sure you have access to the samples form the PDB. We use `mmCIF` files as the highly recommended format for training.
-  For convenience, we provide a command to mirror the PDB:
-
-  ```bash
-  # Full mirror (~100 GB)
-  atomworks pdb sync /path/to/pdb_mirror  # This will create a carbon-copy of the PDB, dated today, in the specified directory. It will download the .mmcif files in the same sharding pattern as the original PDB and keep them gzipped for efficiency.
-
-#   # If, for some reason you only want to download specific IDs, the CLI also supports this:
-#   atomworks pdb sync /path/to/pdb_mirror --pdb-id 1A0I --pdb-id 7XYZ  # This will only download the specified PDB IDs.
-#   # or
-#   atomworks pdb sync /path/to/pdb_mirror --pdb-ids-file /path/to/ids.txt  # This will download the PDB IDs listed in the file, one per line. Each line should be a PDB ID (e.g. '6lyz') and separated by a newline.
-  ```
-
-  Once the mirror is created, set the environment variable:
-
-  ```bash
-  export PDB_MIRROR_PATH=/path/to/pdb_mirror
-  ```
-
-  To have this more permanent, you can add it to a `.env` file in your home directory. Here is an [example of a `.env`](.env.sample) file structure that you can copy, rename to `.env` and edit with your own paths.
-
-**Step 2 — Get PDB metadata (PN units and interfaces)**
-    To calculate sampling probabilities and filter examples for splits, we pre-process the PDB with metadata for each PDB entry. 
-    To save you the work, we provide pre-computed metadata (dated July 15/2025) for downloading:
-
-  ```bash
-  atomworks setup metadata /path/to/metadata  # This will download the metadata (as .tar.gz) and extract it to the specified directory.
-  ```
-
-  This produces parquet files at:
-
-- `/path/to/metadata/pn_units_df.parquet` — Contains metadata for each *PN unit* in the PDB. The term *pn unit* is shorthand for `polymer XOR non-polymer unit` and behaves for almost all purposes like the `chain` in a PDB file. The only difference is that a ligand composed of multiple covalently bonded ligands is considered a single PN unit (whilst it would be multiple chains in a PDB file). Effectively this `.parquet` is a large table of all individual chains, ligands, etc (to be precise, it has one entry per  pn unit) in the PDB that includes helpful metadata for filtering and sampling.
-- `/path/to/metadata/interfaces_df.parquet` — Contains metadata for each interface in the PDB. This `.parquet` is a large table of all binary interfaces in the PDB. It lists each interface as (pn_unit_1, pn_unit_2) pairs and includes helpful metadata for filtering and sampling.
-
-  Alternatively, you can generate fresher metadata yourself (scripts will be uploaded in the coming weeks).
-
-**Step 3 — Configure an AF3-style dataset (example: train only on D-polypeptides)**
-Next we need to use the metadata to configure a dataset that we would like to sample from. This includes e.g. training cut-off, filters, transforms to apply, etc.
-Here's a simple example that:
-
-- Filters to D-polypeptide and L-polypeptide chains only (`POLYPEPTIDE_D` and `POLYPEPTIDE_L` -- to include additional chain types, replace the lists with the appropriate IDs (see [mapping](./src/atomworks/enums.py#L31-L45) in comments).
-- Excludes ligands in the AF3 list of excluded ligands, available at [`atomworks.io.constants.AF3_EXCLUDED_LIGANDS_REGEX`](./src/atomworks/constants.py#L350).
-
-```yaml
-# NOTE: The below is a hydra config and the _target_ fields are the hydra syntax for instantiating a class.
-#  You can use this without hyrda, but will then instead need to provide the corresponding arguments for the
-#  _target_ objects directly.
-
-# Chain type ids used below (from atomworks.enums.ChainType):
-# 0=CyclicPseudoPeptide, 1=OtherPolymer, 2=PeptideNucleicAcid,
-# 3=DNA, 4=DNA_RNA_HYBRID, 5=POLYPEPTIDE_D, 6=POLYPEPTIDE_L, 7=RNA,
-# 8=NON_POLYMER, 9=WATER, 10=BRANCHED, 11=MACROLIDE
-
-af3_pdb_dataset:
-  _target_: atomworks.ml.datasets.datasets.ConcatDatasetWithID
-  datasets:
-    # Single PN units
-    - _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
-      dataset_parser:
-        _target_: atomworks.ml.datasets.parsers.PNUnitsDFParser
-      transform:
-        _target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
-        is_inference: false
-        n_recycles: 5  # This means that we will subsample 5 random sets from the MSA for each example.
-        crop_size: 256
-        crop_contiguous_probability: 0.3333333333333333
-        crop_spatial_probability: 0.6666666666666666
-        diffusion_batch_size: 32
-        # Optional templates (if available)
-        template_lookup_path: ${paths.shared}/template_lookup.csv
-        template_base_dir: ${paths.shared}/template
-        # Optional MSAs (see Step 4)
-        # protein_msa_dirs:
-        #   - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
-        # rna_msa_dirs:
-        #   - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
-      dataset:
-        _target_: atomworks.ml.datasets.datasets.PandasDataset
-        name: pn_units
-        id_column: example_id
-        data: /path/to/metadata/pn_units_df.parquet
-        filters:
-          - "deposition_date < '2022-01-01'"
-          - "resolution < 5.0 and ~method.str.contains('NMR')"
-          - "num_polymer_pn_units <= 20"
-          - "cluster.notnull()"
-          - "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
-          # Train only on D-polypeptides:
-          - "q_pn_unit_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
-          # Exclude ligands from AF3 excluded set:
-          - "~(q_pn_unit_non_polymer_res_names.notnull() and q_pn_unit_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
-        columns_to_load: null
-      save_failed_examples_to_dir: null
-
-    # Binary interfaces
-    - _target_: atomworks.ml.datasets.datasets.StructuralDatasetWrapper
-      dataset_parser:
-        _target_: atomworks.ml.datasets.parsers.InterfacesDFParser
-      transform:
-        _target_: atomworks.ml.pipelines.af3.build_af3_transform_pipeline
-        is_inference: false
-        n_recycles: 5
-        crop_size: 256
-        crop_spatial_probability: 1.0
-        crop_contiguous_probability: 0.0
-        diffusion_batch_size: 32
-        template_lookup_path: ${paths.shared}/template_lookup.csv
-        template_base_dir: ${paths.shared}/template
-        # Optional MSAs (see Step 4)
-        # protein_msa_dirs:
-        #   - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
-        # rna_msa_dirs:
-        #   - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
-      dataset:
-        _target_: atomworks.ml.datasets.datasets.PandasDataset
-        name: interfaces
-        id_column: example_id
-        data: /path/to/metadata/interfaces_df.parquet
-        filters:
-          - "deposition_date < '2022-01-01'"
-          - "resolution < 5.0 and ~method.str.contains('NMR')"
-          - "num_polymer_pn_units <= 20"
-          - "cluster.notnull()"
-          - "method in ['X-RAY_DIFFRACTION', 'ELECTRON_MICROSCOPY']"
-          # Train only on D-polypeptide interfaces:
-          - "pn_unit_1_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
-          - "pn_unit_2_type in [5, 6]"  # 5 = POLYPEPTIDE_D, 6 = POLYPEPTIDE_L
-          - "~(pn_unit_1_non_polymer_res_names.notnull() and pn_unit_1_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
-          - "~(pn_unit_2_non_polymer_res_names.notnull() and pn_unit_2_non_polymer_res_names.str.contains('${af3_excluded_ligands_regex}', regex=True))"
-        columns_to_load: null
-      cif_parser_args:
-        cache_dir: null
-      save_failed_examples_to_dir: null
-```
-
-**Step 4 — MSAs (optional)**
-We are working on a way to make MSAs accessible to the public, but due to the large storage requirements (multiple TB) we are still working on this. If your organization has interest & capacity to host the MSAs, please contact us. In the meantime, if you have MSAs (e.g., from OpenProteinSet) you can configure the pipeline to use them like so:
-
-```yaml
-    protein_msa_dirs:
-      - { dir: /path/to/msa, extension: .a3m.gz, directory_depth: 2 }
-    rna_msa_dirs:
-      - { dir: /path/to/msa, extension: .afa, directory_depth: 0 }
-```
-
-Or alternatively not use MSAs.
-
-**Step 5 — Train a model**
-You now have a full fledged dataset that you can use to train models on! If you want to just try this out without having to download the whole PDB and the metdatada, you can instead run our tests which have a mini-mockup of the pipeline with real pdb files, metadata, distillation data, templates and MSAs for the example of AF3. You can download all this relevant metadata via the atomworks CLI:
-
-> Note: Make sure you are in the AtomWorks root directory when you run the following command, otherwise a new tests/data folder will be created in your current working directory.
-
-```bash
-atomworks setup tests  # This will download the test pack to `tests/data` and unpack it there (~500 MB). 
-```
-
-You will now have a mini PDB at `tests/data/pdb` and a mini custom CCD at `tests/data/ccd`. MSA and template data is in `tests/data/shared` and the distillation and metadata are in `data/ml/af2_distillation`, `data/ml/pdb_pn_units` and `data/ml/pdb_interfaces`. A dataset that uses all of these is [for example here](./tests/ml/conftest.py#L300).
-
-To run the tests for the various datasets, you can run the following command:
-
-```bash
-# Make sure you have the correct environment activated, and set your paths correctly in the .env file / shell environment variables (see points above)
-pytest tests/ml/pipelines/test_data_loading_pipelines.py
-```
-
 ---
 
 ## Contribution
diff --git a/docs/conf.py b/docs/conf.py
@@ -47,7 +47,7 @@
 ]
 
 templates_path = ["_templates"]
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "examples/GALLERY_HEADER.rst"]
 
 # -- Options for HTML output -------------------------------------------------
 # https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output
diff --git a/docs/contributor_guide.rst b/docs/contributor_guide.rst
@@ -11,10 +11,14 @@ As you code
 -------------
 
 1. **Reduce cognitive overhead:**
+   
    a. Pick meaningful, descriptive variable names.
+   
    b. Write docstrings (leverage AI!) and comments. To be used in the API documentation the docstring should 
-   follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
+      follow the Google style guide: `Google Python Style Guide <https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings>`_.
+
    c. Follow the `Python Zen <https://peps.python.org/pep-0020/>`_ – explicit is better than implicit, etc.
+
 2. **Write tests.**
 
 As you commit
@@ -62,20 +66,8 @@ To build the documentation, navigate to the ``docs`` directory and run:
 If you are new to Sphinx, please refer to the `Sphinx documentation <https://www.sphinx-doc.org/en/master/>`_ for guidance on writing and formatting documentation.
 All of the documentation is written in reStructuredText (reST) format. For more information on reST, see the `reStructuredText Primer <https://docutils.sourceforge.io/docs/user/rst/quickstart.html>`_.
 
-
-Visual Aids
------------
-
-.. image:: _static/best_practices_cognitive_load.png
-   :alt: Good vs Bad Cognitive Load
-   :width: 400px
-
-.. image:: _static/best_practices_mental_models.png
-   :alt: Internalized Mental Models
-   :width: 300px
-
-More detail
------------
+Other Resources
+---------------
 
 - `Best Practices for Code Review | SmartBear <https://smartbear.com/learn/code-review/best-practices-for-peer-code-review/>`_
 
@@ -90,7 +82,7 @@ PR Hygiene
 When contributing to this repository, please follow these steps:
 
 1. Clone the repository
-2. Create the development environment (see the :ref:`Local Conda Environment<local-conda_environment>` section in the Installation Guide).
+2. Create the development environment (see the *Local Conda Environment* section in the Installation Guide).
 3. Create a new branch for your changes. 
    - Use the following convention to name your branch: ``<category>/<description>``. Categories: ``feat``, ``fix``, ``hotfix``, ``refactor``, ``docs``, ``perf``.
    - Example: ``feat/support-rdkit-small-molecule``
diff --git a/docs/index.rst b/docs/index.rst
@@ -22,4 +22,5 @@ Welcome to **atomworks** — a toolkit for converting, parsing, and manipulating
    glossary
    api_reference
    auto_examples/index
-   contributor_guide
+   contributor_guide
+   mirrors
diff --git a/docs/installation.rst b/docs/installation.rst
@@ -12,7 +12,7 @@ This is the easiest way to get started with atomworks.
    pip install atomworks # base installation version without torch (for only atomworks.io)
    pip install "atomworks[ml]" # with torch and ML dependencies (for atomworks.io plus atomworks.ml)
    pip install "atomworks[dev]" # with development dependencies
-   pip install "atomworks[ml,dev]" # with all dependencies
+   pip install "atomworks[ml,dev]" # with all dependencies"
 
 You can also install AtomWorks with `Open Babel <https://openbabel.org/>`_, an alternative to RDKit:
 
diff --git a/docs/mirrors.rst b/docs/mirrors.rst

Original file line number	Diff line number	Diff line change
`@@ -47,7 +47,7 @@`
`47`	`47`	`]`
`48`	`48`
`49`	`49`	`templates_path = ["_templates"]`
`50`		`-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"]`
	`50`	`+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "examples/GALLERY_HEADER.rst"]`
`51`	`51`
`52`	`52`	`# -- Options for HTML output -------------------------------------------------`
`53`	`53`	`# https://www.sphinx-doc.org/en/master/usage/configuration.html#options-for-html-output`