Skip to content

Commit 253e5c8

Browse files
author
Nam Do
committed
Cifs direct template structure
1 parent 59175f0 commit 253e5c8

11 files changed

Lines changed: 711 additions & 47 deletions

File tree

docs/source/Inference.md

Lines changed: 61 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,12 +24,12 @@ Supported:
2424
- Template-based prediction
2525
- using ColabFold template alignments
2626
- using pre-computed template alignments
27+
- using direct CIF template files (no alignments required)
2728
- Non-canonical residues
2829

2930
Coming soon:
3031

3132
- Covalently modified residues and other cross-chain covalent bonds
32-
- User-specified template structures (as opposed to top 4)
3333

3434
### 1.2 DNA
3535

@@ -301,6 +301,61 @@ model_update:
301301

302302
---
303303

304+
(inference-cif-direct-templates)=
305+
#### 🧬 CIF Direct Template Mode
306+
307+
OpenFold3 supports providing template structures directly as CIF files without requiring pre-computed template alignments. In this mode, the system automatically:
308+
1. Parses each provided CIF file
309+
2. Extracts all chains and their sequences
310+
3. Aligns each chain to your query sequence
311+
4. Selects the best matching chain based on sequence identity × coverage score
312+
313+
This is particularly useful for stateless inference environments or when you have specific template structures but no alignment files.
314+
315+
**Usage:**
316+
317+
In your query JSON, specify `template_cif_paths` instead of `template_alignment_file_path`:
318+
319+
```json
320+
{
321+
"queries": {
322+
"my_query": {
323+
"chains": [
324+
{
325+
"molecule_type": "protein",
326+
"chain_ids": ["A"],
327+
"sequence": "MKLLVVDDAGQKFT...",
328+
"template_cif_paths": [
329+
"path/to/template1.cif",
330+
"path/to/template2.cif",
331+
"path/to/template3.cif"
332+
],
333+
"template_cif_chain_ids": ["A", null, "B"]
334+
}
335+
]
336+
}
337+
}
338+
}
339+
```
340+
341+
Optionally, use `template_cif_chain_ids` to specify which chain to use from each CIF file. Use `null` to let the system automatically select the best-matching chain.
342+
343+
**Configuration:**
344+
345+
You can adjust the minimum score threshold for chain selection in your `runner.yml`:
346+
347+
```yaml
348+
template_preprocessor_settings:
349+
cif_direct_min_score: 0.1 # Default: 0.1 (seq_identity × coverage)
350+
```
351+
352+
**Notes:**
353+
- For multi-chain CIF files, only the best matching chain per file is used as a template
354+
- The `template_cif_paths` field cannot be used together with `template_alignment_file_path`
355+
- This mode is currently supported for protein chains only
356+
357+
---
358+
304359
### 3.4 Customized ColabFold MSA Server Settings Using `runner.yml`
305360

306361
All settings for the ColabFold server and outputs can be set under [`msa_computation_settings`](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/tools/colabfold_msa_server.py#L904)
@@ -478,9 +533,13 @@ This file representing the full input query in a validated internal format defin
478533

479534
- `template_alignment_file_path`: Path to the preprocessed template cache entry `.npz` file used for template featurization. By default, template cache entries are automatically created in a short preprocessing step using the raw template alignment files provided under this same field and the template structures identified in the alignment.
480535

536+
- `template_cif_paths`: List of paths to CIF template files when using {ref}`CIF direct template mode <inference-cif-direct-templates>`. This field is mutually exclusive with `template_alignment_file_path`.
537+
538+
- `template_cif_chain_ids`: List of chain IDs to use from each corresponding CIF file in `template_cif_paths`. Use `null` for entries where automatic chain selection is desired. Must have the same length as `template_cif_paths` if provided.
539+
481540
- `template_entry_chain_ids`: List of template chains, identified by their entry (typically PDB) IDs and chain IDs, used for featurization. By default, up to the first 4 of these chains are used.
482541

483-
Note: Refer to the {doc}`Template How-To Documentation <template_how_to>` for how to specify these fields if you want to use precomputed template alignments instead of Colabfold alignments for template inputs.
542+
Note: Refer to the {doc}`Template How-To Documentation <template_how_to>` for how to specify these fields if you want to use precomputed template alignments instead of Colabfold alignments for template inputs, or see {ref}`CIF Direct Template Mode <inference-cif-direct-templates>` for using template structures directly without alignments.
484543

485544
Note: If MSA and template files are persisted between runs, the same `inference_query_set.json` file can be used to resubmit the query without needing to rerun the template and MSA pipelines. To do so:
486545

docs/source/input_format.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,8 @@ All chains must define a unique ```chain_ids``` field and appropriate sequence o
6464
"paired_msa_file_paths": "/absolute/path/to/paired_msas",
6565
"template_alignment_file_path": "/absolute/path/to/template_msa",
6666
"template_entry_chain_ids": ["entry1_A", "entry2_B", "entry3_A"],
67+
"template_cif_paths": ["/path/to/template1.cif", "/path/to/template2.cif"],
68+
"template_cif_chain_ids": ["A", null],
6769
}
6870
```
6971

@@ -119,6 +121,19 @@ All chains must define a unique ```chain_ids``` field and appropriate sequence o
119121
- Use this field only when running inference with **precomputed alignments**. See the {doc}`Running with Templates Documentation <template_how_to>` for details.
120122
- If using the ColabFold MSA server, this field is automatically populated and will **override any user-provided path**.
121123

124+
- `template_cif_paths` *(list[str], optional, default = null)*
125+
- List of paths to CIF files to use as templates for this chain.
126+
- Enables **CIF-direct template mode**, which parses templates directly from CIF files without requiring pre-computed alignments.
127+
- Alignments are computed on-the-fly using Kalign.
128+
- This is useful when you have known template structures but no pre-computed MSA/template alignments.
129+
- Example: `["/path/to/template1.cif", "/path/to/template2.cif"]`
130+
131+
- `template_cif_chain_ids` *(list[str | null], optional, default = null)*
132+
- List of chain IDs to use from each corresponding CIF file in `template_cif_paths`.
133+
- Must have the same length as `template_cif_paths` if provided.
134+
- Use `null` for a specific entry to let the parser automatically select the best-matching chain.
135+
- Example: `["A", null, "B"]` - uses chain A from the first CIF, auto-selects from the second, and uses chain B from the third.
136+
122137
### 3.2. RNA Chains
123138

124139
```

docs/source/template_how_to.md

Lines changed: 94 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
# Running OpenFold3 Inference with Templates
22

3-
This document contains instructions on how to use template information for OF3 predictions. Here, we assume that you already generated all of your template alignments or intend to fetch them from Colabfold on-the-fly. If you do not have any precomputed template alignments and do not want to use Colabfold, refer to our {doc}`MSA Generation Guide <precomputed_msa_generation_how_to>` before consulting this document. If you need further clarifications on how some of the template components of our inference pipeline work, refer to {doc}`this explanatory document <template_explanation>`.
3+
This document contains instructions on how to use template information for OF3 predictions. OpenFold3 supports two template modes:
4+
5+
1. **Alignment-based templates** (traditional): Requires template alignments and template structures
6+
2. **CIF direct templates** (simplified): Requires only template CIF files, no alignments needed
7+
8+
For alignment-based templates, we assume you already generated all of your template alignments or intend to fetch them from Colabfold on-the-fly. If you do not have any precomputed template alignments and do not want to use Colabfold, refer to our {doc}`MSA Generation Guide <precomputed_msa_generation_how_to>` before consulting this document.
9+
10+
If you need further clarifications on how some of the template components of our inference pipeline work, refer to {doc}`this explanatory document <template_explanation>`.
411

512
The template pipeline currently supports monomeric templates and has been tested for protein chains only.
613

@@ -12,10 +19,18 @@ The main steps detailed in this guide are:
1219
(1-template-files)=
1320
## 1. Template Files
1421

15-
Template featurization requires query-to-template **alignments** and template **structures**.
22+
OpenFold3 supports two modes for providing template information:
23+
24+
### Alignment-Based Mode (Traditional)
25+
Requires query-to-template **alignments** and template **structures**. Sections 1.1 and 1.2 below describe the required file formats.
26+
27+
### CIF Direct Mode (Simplified)
28+
Requires only template **CIF files**. The system automatically aligns template chains to your query sequence and selects the best matching chain. See {ref}`Section 2.3 <23-cif-direct-templates>` for usage details.
29+
30+
---
1631

1732
(11-template-aligment-file-format)=
18-
### 1.1. Template Aligment File Format
33+
### 1.1. Template Alignment File Format (Alignment-Based Mode)
1934

2035
Template alignments can be provided in either `sto`, `a3m` or `m8` format. Template alignments from the Colabfold server are in `m8` format.
2136

@@ -73,16 +88,18 @@ query_A template_C 71.4 14 4 0 5 18 75 88 2e-03 22.3
7388

7489
Note that since `m8` files do not provide actual alignments, we only use them to identify which structure files to get templates from, retrieve sequences from these structure files and always realign them to the query sequence using Kalign. More on this in the [template processing explanatory document](template_explanation.md).
7590

76-
### 1.2. Template Structure File Format
91+
### 1.2. Template Structure File Format (Alignment-Based Mode)
92+
93+
For alignment-based templates, template structures currently can only be provided in `cif` format. An upcoming release will add support for parsing templates from `pdb` files.
7794

78-
Template structures currently can only be provided in `cif` format. An upcoming release will add support for parsing templates from `pdb` files.
95+
**Note:** For {ref}`CIF direct mode <23-cif-direct-templates>`, template CIF files are specified directly in the query JSON without separate structure directories.
7996

8097
(2-specifying-template-information-in-the-inference-query-file)=
8198
## 2. Specifying Template Information in the Inference Query File
8299

83-
### 2.1. Specifying Alignments
100+
### 2.1. Specifying Alignments (Alignment-Based Mode)
84101

85-
The data pipeline needs to know which template alignment to use for which chain. This information is provided by specifying the {ref}`paths to the alignments <31-protein-chains>` for each chain's `template_alignment_file_path` field in the inference query json file.
102+
For alignment-based templates, the data pipeline needs to know which template alignment to use for which chain. This information is provided by specifying the {ref}`paths to the alignments <31-protein-chains>` for each chain's `template_alignment_file_path` field in the inference query json file.
86103

87104
Note that when fetching alignments from the Colabfold server, `template_alignment_file_path` fields are automatically populated.
88105

@@ -118,9 +135,9 @@ Note that when fetching alignments from the Colabfold server, `template_alignmen
118135
</code></pre>
119136
</details>
120137

121-
### 2.2. Using Specific Templates
138+
### 2.2. Using Specific Templates (Alignment-Based Mode)
122139

123-
By default, the template pipeline automatically populates the `template_entry_chain_ids` field with [n templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1535) from the alignment, which is then further subset to the [top k templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L116) during featurization for inference.
140+
By default, for alignment-based templates, the template pipeline automatically populates the `template_entry_chain_ids` field with [n templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/core/data/pipelines/preprocessing/template.py#L1535) from the alignment, which is then further subset to the [top k templates](https://github.com/aqlaboratory/openfold-3/blob/main/openfold3/projects/of3_all_atom/config/dataset_config_components.py#L116) during featurization for inference.
124141

125142
In an **upcoming release**, we will add support for specifying *specific templates* for the data pipeline to use for featurization. This will be possible through the `template_entry_chain_ids` field:
126143

@@ -156,10 +173,77 @@ entry3_A MK----DDARGQGKFT
156173
//
157174
```
158175

176+
(23-cif-direct-templates)=
177+
### 2.3. CIF Direct Templates (No Alignments Required)
178+
179+
OpenFold3 supports providing template structures directly as CIF files without requiring pre-computed template alignments. This is particularly useful for:
180+
- Stateless inference environments (e.g., NVIDIA Inference Microservices)
181+
- Quick predictions when you have specific template structures
182+
- Simplified workflows without external alignment tools
183+
184+
#### How It Works
185+
186+
In CIF direct mode, the system automatically:
187+
1. Parses each provided CIF file to extract all chains and their sequences
188+
2. Aligns each chain sequence to your query sequence using sequence alignment
189+
3. Scores each chain by `sequence_identity × coverage`
190+
4. Selects the best matching chain as the template (if score ≥ minimum threshold)
191+
192+
For multi-chain CIF files, only the best matching chain per file is used.
193+
194+
#### Usage Example
195+
196+
Specify `template_cif_paths` instead of `template_alignment_file_path` in your query JSON:
197+
198+
```json
199+
{
200+
"queries": {
201+
"my_protein": {
202+
"chains": [
203+
{
204+
"molecule_type": "protein",
205+
"chain_ids": ["A", "B"],
206+
"sequence": "XRMKQLEDKVEELLSKNYHLENEVARLKKLVGER",
207+
"template_cif_paths": [
208+
"templates/1dgc.cif",
209+
"templates/1ysa.cif",
210+
"templates/1zta.cif"
211+
]
212+
}
213+
]
214+
}
215+
}
216+
}
217+
```
218+
219+
**Example query files:**
220+
- [Homomer with direct CIF templates](https://github.com/aqlaboratory/openfold-3/blob/main/examples/example_inference_inputs/query_homomer_with_direct_cif_templates.json)
221+
- [Multimer with direct CIF templates](https://github.com/aqlaboratory/openfold-3/blob/main/examples/example_inference_inputs/query_multimer_with_direct_cif_templates.json)
222+
223+
#### Configuration
224+
225+
Adjust the minimum score threshold for chain selection in your `runner.yml`:
226+
227+
```yaml
228+
template_preprocessor_settings:
229+
cif_direct_min_score: 0.1 # Default: 0.1 (seq_identity × coverage)
230+
```
231+
232+
Only chains with a score (sequence identity × coverage) above this threshold will be considered as valid templates.
233+
234+
#### Important Notes
235+
236+
- The `template_cif_paths` field is **mutually exclusive** with `template_alignment_file_path` - you must use one or the other, not both
237+
- Template structures must be in CIF format
238+
- Currently supported for protein chains only
239+
- For multi-chain CIF files, the system automatically selects the best matching chain per file
240+
159241
(3-optimizations-for-high-throughput-workflows)=
160242
## 3. Optimizations for High-Throughput Workflows
161243

162-
For high-throughput use cases, where a large number of structures are to be predicted, template processing can take a significant amount of time even with the built-in {doc}`deduplication utility <template_explanation>` we have for template alignment and structure processing. To avoid having to spend GPU compute on data transformations, we provide separate template preprocessing scripts to generate the necessary inputs from which template featurization can run efficiently in a subsequent job without being a bottleneck to the model forward pass.
244+
**Note:** The optimizations described in this section apply to **alignment-based templates**. If you're using {ref}`CIF direct templates <23-cif-direct-templates>`, the workflow is already simplified and these preprocessing steps are not necessary.
245+
246+
For high-throughput use cases with alignment-based templates, where a large number of structures are to be predicted, template processing can take a significant amount of time even with the built-in {doc}`deduplication utility <template_explanation>` we have for template alignment and structure processing. To avoid having to spend GPU compute on data transformations, we provide separate template preprocessing scripts to generate the necessary inputs from which template featurization can run efficiently in a subsequent job without being a bottleneck to the model forward pass.
163247

164248
### 3.1. Template Alignment Preprocessing
165249

openfold3/core/data/framework/data_module.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@
4141

4242
import dataclasses
4343
import enum
44+
import logging
4445
import random
4546
import warnings
4647
from typing import Any
@@ -74,6 +75,7 @@
7475
from openfold3.core.utils.tensor_utils import dict_multimap
7576

7677
_NUMPY_AVAILABLE = RequirementCache("numpy")
78+
logger = logging.getLogger(__name__)
7779

7880

7981
class DatasetMode(enum.Enum):
@@ -516,8 +518,15 @@ def __init__(
516518
self.inference_config = _configs.configs[0]
517519

518520
def prepare_data(self) -> None:
521+
logger.info("=" * 60)
522+
logger.info(
523+
f"Prepare data: use_msa_server={self.use_msa_server}, use_templates={self.use_templates}"
524+
)
525+
logger.info("=" * 60)
526+
519527
# Colabfold msa preparation
520528
if self.use_msa_server:
529+
logger.info("Running ColabFold MSA server...")
521530
self.inference_config.query_set = preprocess_colabfold_msas(
522531
inference_query_set=self.inference_config.query_set,
523532
compute_settings=self.msa_computation_settings,
@@ -529,11 +538,13 @@ def prepare_data(self) -> None:
529538
)
530539

531540
if self.use_templates:
541+
logger.info("Running template preprocessing...")
532542
template_preprocessor = TemplatePreprocessor(
533543
input_set=self.inference_config.query_set,
534544
config=self.inference_config.template_preprocessor_settings,
535545
)
536546
template_preprocessor()
547+
logger.info("Template preprocessing complete!")
537548

538549
def setup(self, stage=None):
539550
"""Broadcast updated query set to all ranks if multiple GPUs are used."""

0 commit comments

Comments
 (0)