Benchmark for Biomolecular Structure Prediction Models

📄 From Dataset Curation to Unified Evaluation: Revisiting Structure Prediction Benchmarks with PXMeter

Usage
- 1. Prepare the Dataset
- 2. Run Evaluation on a Dataset
How Aggregation Works
Output Structure
File Descriptions
Extending the Benchmark with New Models
Metrics and Rankers

Usage

This section explains how to prepare the benchmark datasets and how to run the complete evaluation pipeline, from data setup to result aggregation.

1. Prepare the Dataset

There are two supported ways to obtain the dataset:

Generate your own dataset following the full data pipeline, or
Download the pre-built dataset (licensed under CC0).

Option A — Build your own dataset using the Data Pipeline

If you want to use a different time window or periodically rebuild the benchmark as new PDB entries are released, you can construct the dataset yourself using the provided pipeline.

Follow the instructions in Dataset Pipeline Overview to run all steps of the dataset construction process. (If you do not modify any part of the pipeline, the entire dataset can be generated with a single command.)

For a detailed explanation of every generated file, refer to Dataset Output Files Reference.

Option B — Download the pre-built dataset (CC0 License)

If you prefer not to run the pipeline, you can directly download one of the pre-built datasets (CC0 licensed). Each dataset can be downloaded and extracted with:

wget <DOWNLOAD_URL>
tar xzvf <FILENAME>.tar.gz -C [your_path]

For a full list of available datasets and their descriptions, see Available Datasets.

2. Run Evaluation on a Dataset

2.1 Set the Dataset Path

First, set the PXM_EVAL_DATA_ROOT_PATH environment variable to the root directory where you placed the benchmark source data (mmCIF files, metadata CSVs, etc.):

export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"

This root is used internally by the evaluation scripts to locate:

reference mmCIF files,
subset/metadata CSVs (e.g., RecentPDB low-homology subsets, AF3-AB metadata),
PoseBusters ligand information and validity annotations.

2.2 Run Evaluation

Run the evaluation script as follows:

python -m benchmark.run_eval \
    -i [infer_dir] \
    -o [output_dir] \
    -r [ref_assembly_id] \
    -l [lig_info_csv] \
    -m [model] \
    -n [num_cpu] \
    -c [chunk_str]

infer_dir: Directory containing inference results (mmCIF files and model-specific confidence files).
output_dir: Directory where evaluation results (per-entry JSONs) will be saved.
ref_assembly_id: Reference assembly ID to use for evaluation. Default is None (asymmetric unit). If you are using the RecentPDB dataset generated by the data pipeline, this value should be set to "1".
lig_info_csv: Path to the CSV file containing ligand information for the dataset. Default is None. If you intend to evaluate pocket-aligned RMSD and PoseBusters ligand validity checks for small molecules in the dataset, you must provide a CSV file containing the columns "entry_id" and "label_asym_id". For datasets generated by the PXMeter data pipeline, this file is located at [PXM_EVAL_DATA_ROOT_PATH]/supported_data/RecentPDB_low_homology_lig_info.csv. If this argument is not provided, these two ligand-specific evaluations will be skipped.
model: Name of the model whose outputs you are evaluating. This string selects the corresponding evaluator and file layout. It must match one of the module names in benchmark/evaluators/ (e.g., af3, protenix, chai, boltz).
num_cpu: Number of CPU cores to use. Default is 1 (use a positive value to limit cores, or -1 to use all available cores, depending on implementation).
chunk_str: Optional chunk specifier such as "1of8", "2of8", … used to split the list of PDB entries into roughly equal parts when distributing evaluation across multiple jobs or machines. If omitted, all entries are evaluated in a single run.

Each supported model has its own expected directory layout under infer_dir; the evaluator selected by -m knows how to:

locate the predicted mmCIF for each (entry_id, seed, sample),
locate and parse model-specific confidence scores,
write one JSON metrics file per evaluated sample (see next section for how these JSONs are aggregated).

2.3 Create a JSON file specifying evaluation result paths

Prepare a JSON file that describes the model, its seeds, and the locations of evaluation outputs for each dataset:

{
  "name": {
    "model": "model name (protenix/af2m/chai/boltz...)",
    "seeds": [101, 102],
    "dataset_path": {
      "RecentPDB": "path/to/eval_results/RecentPDB",
      "PoseBusters": "path/to/eval_results/PoseBusters",
      "AF3-AB": "path/to/eval_results/AF3_AB"
    }
  }
}

The top-level key ("name" in the example) is a logical identifier for this model configuration. You can create multiple entries (e.g., "protenix_v1", "protenix_baseline", "px_finetuned") in the same JSON file.
model: The same model string used in run_eval.py, corresponding to one of the module names in benchmark/evaluators/.
seeds: List of integer seeds to include for this configuration. All metrics will be computed by intersecting the PDB IDs and chain/interface pairs that are present across all provided seeds for a given configuration.
dataset_path: A mapping from evaluation dataset name to the directory that contains evaluation outputs for that dataset.

The dataset_path field maps dataset names to their corresponding evaluation output directories.

Allowed keys for dataset_path are: "RecentPDB", "PoseBusters", "dsDNA-Protein", "RNA-Protein", and "AF3-AB".

You may include only the datasets you evaluated.

Note: If you are using the dataset created by the dataset pipeline, the correct key to use is "RecentPDB".

2.4 Aggregate and Display Evaluation Results

Run the aggregation script:

export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"
python -m benchmark.show_intersection_results \
    -i [input_json] \
    -o [output_path] \
    -n [num_cpu] \
    -c [subset_csv] \
    -p [pdb_id_list_file] \
    --overwrite_agg \
    --out_file_name [Summary_table_basename]

Recommended Minimal Usage

In typical workflows, only two arguments are required and recommended:

-i — Path to the evaluation-results JSON file (prepared in Step 2.3).
-o — Directory where aggregated CSV files will be written.
-n — Number of CPU cores to use for aggregation.

All other flags are optional and are only needed when applying advanced restrictions (e.g., limiting to a specific PDB list, applying subset filtering, overriding cached parquet files, or customizing output names).

If no such customization is required, you may simply run:

export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"
python -m benchmark.show_intersection_results \
    -i pxm_eval_paths.json \
    -o pxm_results \
    -n 8

This command runs the full aggregation pipeline with default settings across all datasets declared in the input JSON.

Parameters

input_json Path to the JSON file describing evaluation results.
output_path (optional) Directory where aggregated CSV files will be written. Default: ./pxm_results.
num_cpu Number of CPU cores used for aggregation.
pdb_id_list_file (optional) Path to a text file listing PDB IDs to retain during aggregation.
subset_csv (optional) Defines a subset of (entry_id, chain_id_1, chain_id_2) pairs to include.
overwrite_agg (optional) Whether to overwrite cached <dataset>_metrics.parquet files.
out_file_name (optional) Base name for exported summary tables.

Extending the Benchmark with New Models

The benchmark system is modular and supports easy integration of new prediction models. To add support for a new model (e.g., my_model), follow these steps:

Create a model directory:

mkdir benchmark/evaluators/my_model
touch benchmark/evaluators/my_model/__init__.py

Define Ranker Keys in benchmark/evaluators/my_model/config.py:
```
RANKER_KEYS = {
    "complex": [("ranking_score", False)], # (score_key, ascending)
    "chain": [],
    "interface": [],
}
```
The RANKER_KEYS dictionary defines which confidence scores should be extracted from your model's JSON output and how they should be interpreted:
- Each entry is a tuple: (json_key, ascending).
- json_key: The key used in your model's original confidence JSON file.
- ascending: Boolean. If True, a lower score is better (e.g., has_clash). If False, a higher score is better (e.g., pLDDT).
Levels of Confidence:
- complex: Scores representing the entire structure (e.g., ranking_score, pTM, ipTM). The system expects a single numerical value for these keys.
- chain: Scores for individual chains (e.g., chain_pLDDT). The system expects a list or dictionary of values corresponding to the chains in the structure.
- interface: Scores for interactions between chain pairs (e.g., chain_pair_ipTM). The system expects a matrix (2D array) or a flattened list that represents scores for all possible chain pairs.

Implement the Evaluator in benchmark/evaluators/my_model/evaluator.py: Inherit from BaseEvaluator and override _get_info_from_each_pdb_dir. This method is responsible for scanning a specific PDB's prediction directory and collecting all available samples (seeds/models).

💡 Non-Developers: If you're not familiar with Python, check out Generate Evaluator with AI for a guide on using LLMs to create your evaluator.

Typical directory structure:

pred_dir/
├── 7rss/                <-- pdb_dir
│   ├── seed_1/
│   │   ├── sample_1.cif
│   │   └── sample_1.json
│   └── seed_2/
│       ├── sample_1.cif
│       └── sample_1.json
└── 8abc/                <-- pdb_dir
    └── ...

Here is an implementation example for the typical directory structure mentioned above:

from pathlib import Path
from benchmark.evaluators.base import BaseEvaluator
from benchmark.evaluators.my_model.config import RANKER_KEYS

class MyModelEvaluator(BaseEvaluator):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.ranker = RANKER_KEYS

    def _get_info_from_each_pdb_dir(self, pdb_dir: Path) -> list:
        """
        Args:
            pdb_dir: Path to the directory containing predictions for a specific PDB ID.
        Returns:
            A list of tuples, each representing one prediction sample.
        """
        name = pdb_dir.name # e.g., "7rss"
        pdb_id = name

        sub_data = []
        # Iterate through seed directories: e.g., seed_1, seed_2
        for seed_dir in pdb_dir.iterdir():
            if not seed_dir.is_dir():
                continue
            seed = seed_dir.name

            # Find all .cif files and their corresponding .json confidences
            for cif_path in seed_dir.glob("*.cif"):
                # Assume filename is "sample_1.cif"
                sample = cif_path.stem.split("_")[-1]
                conf_path = seed_dir / f"sample_{sample}.json"

                if conf_path.exists():
                    sub_data.append((
                        name,
                        pdb_id,
                        seed,
                        sample,
                        cif_path,
                        conf_path,
                        None  # model_chain_id_to_lig_mol
                    ))
        return sub_data

The system will automatically discover your new model. You can then use it by passing -m my_model to benchmark.run_eval.

Verify your implementation: Use the provided verification script to ensure your evaluator is correctly implemented and can process samples end-to-end:
```
python -m benchmark.scripts.verify_evaluator \
    -m my_model \
    -i /path/to/your/predictions \
    -t /path/to/reference/mmcifs \
    --limit 5
```
This script will randomly sample up to limit tasks and run a trial evaluation in a temporary directory.

How Aggregation Works

The aggregation script internally executes two major stages:

Per-dataset JSON → Parquet conversion

For each evaluation dataset referenced in dataset_path, the script:
- Scans the per-sample JSON evaluation outputs (one JSON per (entry_id, seed, sample)).
- Normalizes them into a single tabular structure containing one row per:
  - complex (type == "complex"),
  - chain (type == "chain"),
  - interface (type == "interface").
- Writes a *_metrics.parquet file (and, for PoseBusters validity checks, an additional *_pb_valid.parquet file) in the same directory as the evaluation outputs.
This Parquet file becomes the canonical source for all downstream metrics. It includes:
- identifiers: entry_id, seed, sample,
- chain/interface identifiers: chain_id_1, chain_id_2, entity_id_1, entity_id_2, entity_type_1, entity_type_2,
- metric columns (e.g., LDDT, DockQ, RMSD),
- model-specific ranking scores (used by rankers).
Intersection and subset selection

When aggregating across multiple model configurations and datasets, the script:
- computes a match_key = (entry_id, chain_id_1, chain_id_2) for every row,
- intersects all datasets at this key, ensuring that only samples present in all selected datasets are retained,
- optionally filters by:
  - PDB ID list (pdb_id_list_file),
  - seeds configured for each model in input_json,
  - subset definitions in subset_csv (including optional cluster_id if provided).
For the RecentPDB dataset, cluster annotations (e.g., low-homology clusters) are added on-the-fly using the cluster metadata CSV shipped with the benchmark. For Custom datasets, if subset_csv contains a cluster_id column, that cluster ID is attached to each matching row.
Per-dataset metric computation

Once the intersection DataFrames are prepared, metrics are computed by dataset type:

RecentPDB
- Uses both DockQ (for protein-protein interfaces) and LDDT (for intra-chain and interface-level evaluation).
- Computes ligand and pocket RMSD for protein–ligand chains when the underlying metrics table contains lig_rmsd / pocket_rmsd, and aggregates them over low-homology pocket clusters in the same two-stage way as other metrics.
- Computes CDR-H3 backbone RMSD for antibody heavy chains when the underlying metrics table contains cdr_h3_bb_rmsd. Unlike other metrics, this is averaged directly over all chains without clustering.
- Subsets by antibody vs non-antibody, monomer, peptide, cyclic peptide, etc., using the subset labels in the metadata CSV.
AF3-AB
- Selects AF3 antibody-antigen interfaces using the AF3-AB metadata.
- Computes DockQ and LDDT metrics for these antibody-protein interfaces only.
PoseBusters
- Combines RMSD metrics (ligand and pocket) with PoseBusters validity checks from the *_pb_valid table.
- Computes success rates (e.g., RMSD below threshold) and provides penalized variants of rankers that discount chemically invalid structures.
dsDNA-Protein
- Uses DNA-Protein LDDT, averaging LDDT across both DNA chains for each complex to obtain a single dsDNA-Protein score per case.
RNA-Protein
- Uses RNA-protein LDDT on interfaces tagged as "RNA-Protein".
Custom
- Uses LDDT for whatever chain/interface types are present, without built-in subset semantics.
- If the per-sample metrics table contains a ligand RMSD column named lig_rmsd, RMSD-based aggregate metrics are also computed, and the corresponding RMSD_results.csv / RMSD_details.csv files are populated for this Custom dataset.

Result consolidation
- Per-dataset summary and detail DataFrames are concatenated across model configurations.
- Each result row is annotated with:
  - name (model configuration key from input_json),
  - eval_dataset (e.g., RecentPDB, PoseBusters, AF3-AB),
  - eval_type (e.g., Intra-Protein, Protein-Protein, RNA-Protein, dsDNA-Protein),
  - subset (e.g., All, Antibody=True, Monomer, Peptide),
  - ranker (how the final sample per case was selected).

This design ensures efficient, reproducible evaluation while supporting incremental updates.

Output Structure

After aggregation, the evaluation pipeline produces two levels of outputs:

Global summary tables under the directory specified by output_path (see below for file layout).
Per-dataset result and detail CSVs written into each evaluation directory listed in dataset_path in the input JSON.
For each evaluation dataset, a cached <eval_dataset>_metrics.parquet file is created next to the raw JSON outputs. This Parquet file stores all per-sample metrics and is reused as the canonical data source for subsequent aggregations.

A typical output_path directory (default: ./pxm_results) looks like:

pxm_results
├── DockQ_details.csv
├── DockQ_results.csv
├── LDDT_details.csv
├── LDDT_results.csv
├── RMSD_details.csv
├── RMSD_results.csv
├── Summary_table_ranked.csv
├── Summary_table.csv
└── Summary_table.txt

File Descriptions

Summary_table.csv / Summary_table.txt Provide a high-level summary of the benchmark performance, including metrics such as DockQ success rate, PoseBusters success rate, and LDDT. The Summary_table files use abbreviations for evaluation types; for example, protein-protein interfaces are shown as prot_prot (other interface types follow similar abbreviation patterns).
Summary_table_ranked.csv A ranked version of Summary_table.csv, with evaluation types shown in full (no abbreviations). Each row corresponds to a single (name, eval_dataset, eval_type, subset, ranker) combination, and the table is sorted by the main metric so that the best-performing configuration appears first.
*_results.csv Contain aggregated per-sample metrics derived from the *_metrics.parquet file. These represent the final evaluation metrics after intersecting all selected datasets and applying the chosen ranker.
*_details.csv Contain sample-level details, including the metrics and identifiers used by rankers to select the final prediction per case. Typical columns include:
- id, name, eval_dataset, eval_type, subset, ranker
- entry_id, seed, sample
- chain_id_1, chain_id_2, entity_id_1, entity_id_2, entity_type_1, entity_type_2
- cluster_id, and all relevant metric columns (DockQ, LDDT, RMSD, PoseBusters checks, model scores, etc.)
These files are useful for in-depth debugging and analysis of individual examples.

Metrics and Rankers

The aggregation code distinguishes between summary metrics (one row per dataset / eval type / subset / ranker) and detail rows (one row per evaluated chain or interface). The most important metrics are:

DockQ-based metrics (for protein-protein interfaces on RecentPDB and AF3-AB):
- Computed from per-interface DockQ scores aggregated at the level of structure clusters.
- avg_dockq_avg_sr: mean DockQ value across clusters, then converted to a success rate using a default DockQ threshold (0.23 by default).
- avg_dockq_sr_avg_sr: average success rate when thresholding DockQ within each cluster first.
- Confidence intervals are reported as ci_avg_dockq_avg_sr (exact binomial interval over clusters) and ci_avg_dockq_sr_avg_sr (bootstrap interval).
LDDT-based metrics (chain-level intra metrics and interface metrics):
- For Intra-* eval types, LDDT is computed per chain; for other eval types it is computed per interface.
- LDDT-PLI is computed for ligands, measuring the local distance agreement of the ligand and its binding pocket.
- For each cluster, LDDT values are averaged across all entries in the cluster.
- The summary column lddt stores the mean of these per-cluster averages; ci_lddt stores a bootstrap confidence interval.
- Note: LDDT-PLI scores are aggregated per chain (entry_id + chain_id), not averaged by cluster_id.
RMSD-based metrics (ligand quality)
- lig_avg_rmsd: mean ligand pocket-aligned RMSD across all ligand chains.
- lig_rmsd_sr: success rate under the ligand RMSD threshold (2.0 Å by default).
- Ligand SR: success rate where Ligand RMSD < 2.0 Å AND LDDT-PLI > 0.8.
- PoseBusters validity checks (e.g., steric-clash violations, tetrahedral-chirality correctness) are included as additional columns; when present, they are also used to construct penalized rankers that downweight chemically invalid poses.
- pb_all_valid_sr: success rate of passing all PoseBusters checks.
- pb_all_valid_and_good_rmsd_sr: success rate of simultaneously passing all PoseBusters checks and the ligand RMSD threshold (2.0 Å by default).
These RMSD-based metrics are used for:
- the PoseBusters dataset (where PB validity checks are always present),
- the RecentPDB dataset when ligand RMSD columns (lig_rmsd, pocket_rmsd) are available in the per-sample metrics table, and
- any Custom dataset that provides a lig_rmsd column in its per-sample metrics table.
Note: When using the RecentPDB dataset, all the metrics above are computed in a two-stage manner: a within-cluster average over pocket-chain cluster_id, followed by an across-cluster average.
Framework-aligned CDR-H3 loop backbone RMSD metrics (antibody CDR-H3 quality)
- cdr_h3_bb_avg_rmsd: mean CDR-H3 backbone RMSD across all antibody heavy chains.
- cdr_h3_bb_rmsd_sr: success rate under the CDR-H3 RMSD threshold (1.0 Å by default).
Note: Unlike other metrics in RecentPDB, this metric is aggregated by {entry_id}_{entity_id_1} instead of sequence clusters. The averages and success rates are computed first within each {entry_id}_{entity_id_1} group, and then averaged across all groups.

This metric is computed when calc_cdr_h3_bb_rmsd is enabled in the configuration.

For each metric type, multiple rankers are reported:

Generic rankers such as best, worst, rand, and median select a single sample per case using the target metric.
Model-specific rankers (e.g., best.<score_name> or best.<score_name>.penalized) use internal model scores as selectors, allowing you to compare how well a model's own ranking correlates with the external evaluation metrics.

The *_details.csv files expose the per-case fields (seed, sample, entry_id, chain_id_1, chain_id_2, cluster_id, etc.), so that you can reconstruct or further analyze any specific example reported in the summary tables.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark for Biomolecular Structure Prediction Models

Table of Contents

Usage

1. Prepare the Dataset

2. Run Evaluation on a Dataset

2.1 Set the Dataset Path

2.2 Run Evaluation

2.3 Create a JSON file specifying evaluation result paths

2.4 Aggregate and Display Evaluation Results

Recommended Minimal Usage

Parameters

Extending the Benchmark with New Models

How Aggregation Works

Output Structure

File Descriptions

Metrics and Rankers

FilesExpand file tree

benchmark.md

Latest commit

History

benchmark.md

File metadata and controls

Benchmark for Biomolecular Structure Prediction Models

Table of Contents

Usage

1. Prepare the Dataset

2. Run Evaluation on a Dataset

2.1 Set the Dataset Path

2.2 Run Evaluation

2.3 Create a JSON file specifying evaluation result paths

2.4 Aggregate and Display Evaluation Results

Recommended Minimal Usage

Parameters

Extending the Benchmark with New Models

How Aggregation Works

Output Structure

File Descriptions

Metrics and Rankers