- Usage
- How Aggregation Works
- Output Structure
- File Descriptions
- Extending the Benchmark with New Models
- Metrics and Rankers
This section explains how to prepare the benchmark datasets and how to run the complete evaluation pipeline, from data setup to result aggregation.
There are two supported ways to obtain the dataset:
- Generate your own dataset following the full data pipeline, or
- Download the pre-built dataset (licensed under CC0).
Option A — Build your own dataset using the Data Pipeline
If you want to use a different time window or periodically rebuild the benchmark as new PDB entries are released, you can construct the dataset yourself using the provided pipeline.
Follow the instructions in Dataset Pipeline Overview
to run all steps of the dataset construction process.
(If you do not modify any part of the pipeline, the entire dataset can be generated with a single command.)
For a detailed explanation of every generated file, refer to
Dataset Output Files Reference.
Option B — Download the pre-built dataset (CC0 License)
If you prefer not to run the pipeline, you can directly download one of the pre-built datasets (CC0 licensed). Each dataset can be downloaded and extracted with:
wget <DOWNLOAD_URL>
tar xzvf <FILENAME>.tar.gz -C [your_path]For a full list of available datasets and their descriptions, see Available Datasets.
First, set the PXM_EVAL_DATA_ROOT_PATH environment variable to the root directory where you placed the benchmark source data (mmCIF files, metadata CSVs, etc.):
export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"This root is used internally by the evaluation scripts to locate:
- reference mmCIF files,
- subset/metadata CSVs (e.g., RecentPDB low-homology subsets, AF3-AB metadata),
- PoseBusters ligand information and validity annotations.
Run the evaluation script as follows:
python -m benchmark.run_eval \
-i [infer_dir] \
-o [output_dir] \
-r [ref_assembly_id] \
-l [lig_info_csv] \
-m [model] \
-n [num_cpu] \
-c [chunk_str]infer_dir: Directory containing inference results (mmCIF files and model-specific confidence files).output_dir: Directory where evaluation results (per-entry JSONs) will be saved.ref_assembly_id: Reference assembly ID to use for evaluation. Default isNone(asymmetric unit). If you are using the RecentPDB dataset generated by the data pipeline, this value should be set to"1".lig_info_csv: Path to the CSV file containing ligand information for the dataset. Default isNone. If you intend to evaluate pocket-aligned RMSD and PoseBusters ligand validity checks for small molecules in the dataset, you must provide a CSV file containing the columns"entry_id"and"label_asym_id". For datasets generated by the PXMeter data pipeline, this file is located at[PXM_EVAL_DATA_ROOT_PATH]/supported_data/RecentPDB_low_homology_lig_info.csv. If this argument is not provided, these two ligand-specific evaluations will be skipped.model: Name of the model whose outputs you are evaluating. This string selects the corresponding evaluator and file layout. It must match one of the module names inbenchmark/evaluators/(e.g.,af3,protenix,chai,boltz).num_cpu: Number of CPU cores to use. Default is 1 (use a positive value to limit cores, or-1to use all available cores, depending on implementation).chunk_str: Optional chunk specifier such as"1of8","2of8", … used to split the list of PDB entries into roughly equal parts when distributing evaluation across multiple jobs or machines. If omitted, all entries are evaluated in a single run.
Each supported model has its own expected directory layout under infer_dir; the evaluator selected by -m knows how to:
- locate the predicted mmCIF for each
(entry_id, seed, sample), - locate and parse model-specific confidence scores,
- write one JSON metrics file per evaluated sample (see next section for how these JSONs are aggregated).
Prepare a JSON file that describes the model, its seeds, and the locations of evaluation outputs for each dataset:
{
"name": {
"model": "model name (protenix/af2m/chai/boltz...)",
"seeds": [101, 102],
"dataset_path": {
"RecentPDB": "path/to/eval_results/RecentPDB",
"PoseBusters": "path/to/eval_results/PoseBusters",
"AF3-AB": "path/to/eval_results/AF3_AB"
}
}
}-
The top-level key (
"name"in the example) is a logical identifier for this model configuration. You can create multiple entries (e.g.,"protenix_v1","protenix_baseline","px_finetuned") in the same JSON file. -
model: The same model string used inrun_eval.py, corresponding to one of the module names inbenchmark/evaluators/. -
seeds: List of integer seeds to include for this configuration. All metrics will be computed by intersecting the PDB IDs and chain/interface pairs that are present across all provided seeds for a given configuration. -
dataset_path: A mapping from evaluation dataset name to the directory that contains evaluation outputs for that dataset.
The dataset_path field maps dataset names to their corresponding evaluation output directories.
Allowed keys for dataset_path are:
"RecentPDB", "PoseBusters", "dsDNA-Protein", "RNA-Protein", and "AF3-AB".
You may include only the datasets you evaluated.
Note:
If you are using the dataset created by the dataset pipeline, the correct key to use is "RecentPDB".
Run the aggregation script:
export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"
python -m benchmark.show_intersection_results \
-i [input_json] \
-o [output_path] \
-n [num_cpu] \
-c [subset_csv] \
-p [pdb_id_list_file] \
--overwrite_agg \
--out_file_name [Summary_table_basename]In typical workflows, only two arguments are required and recommended:
-i— Path to the evaluation-results JSON file (prepared in Step 2.3).-o— Directory where aggregated CSV files will be written.-n— Number of CPU cores to use for aggregation.
All other flags are optional and are only needed when applying advanced restrictions (e.g., limiting to a specific PDB list, applying subset filtering, overriding cached parquet files, or customizing output names).
If no such customization is required, you may simply run:
export PXM_EVAL_DATA_ROOT_PATH="your/path/to/dataset"
python -m benchmark.show_intersection_results \
-i pxm_eval_paths.json \
-o pxm_results \
-n 8This command runs the full aggregation pipeline with default settings across all datasets declared in the input JSON.
-
input_jsonPath to the JSON file describing evaluation results. -
output_path(optional) Directory where aggregated CSV files will be written. Default:./pxm_results. -
num_cpuNumber of CPU cores used for aggregation. -
pdb_id_list_file(optional) Path to a text file listing PDB IDs to retain during aggregation. -
subset_csv(optional) Defines a subset of(entry_id, chain_id_1, chain_id_2)pairs to include. -
overwrite_agg(optional) Whether to overwrite cached<dataset>_metrics.parquetfiles. -
out_file_name(optional) Base name for exported summary tables.
The benchmark system is modular and supports easy integration of new prediction models. To add support for a new model (e.g., my_model), follow these steps:
-
Create a model directory:
mkdir benchmark/evaluators/my_model touch benchmark/evaluators/my_model/__init__.py
-
Define Ranker Keys in
benchmark/evaluators/my_model/config.py:RANKER_KEYS = { "complex": [("ranking_score", False)], # (score_key, ascending) "chain": [], "interface": [], }
The
RANKER_KEYSdictionary defines which confidence scores should be extracted from your model's JSON output and how they should be interpreted:- Each entry is a tuple:
(json_key, ascending). json_key: The key used in your model's original confidence JSON file.ascending: Boolean. IfTrue, a lower score is better (e.g.,has_clash). IfFalse, a higher score is better (e.g.,pLDDT).
Levels of Confidence:
complex: Scores representing the entire structure (e.g.,ranking_score,pTM,ipTM). The system expects a single numerical value for these keys.chain: Scores for individual chains (e.g.,chain_pLDDT). The system expects a list or dictionary of values corresponding to the chains in the structure.interface: Scores for interactions between chain pairs (e.g.,chain_pair_ipTM). The system expects a matrix (2D array) or a flattened list that represents scores for all possible chain pairs.
- Each entry is a tuple:
-
Implement the Evaluator in
benchmark/evaluators/my_model/evaluator.py: Inherit fromBaseEvaluatorand override_get_info_from_each_pdb_dir. This method is responsible for scanning a specific PDB's prediction directory and collecting all available samples (seeds/models).💡 Non-Developers: If you're not familiar with Python, check out Generate Evaluator with AI for a guide on using LLMs to create your evaluator.
Typical directory structure:
pred_dir/ ├── 7rss/ <-- pdb_dir │ ├── seed_1/ │ │ ├── sample_1.cif │ │ └── sample_1.json │ └── seed_2/ │ ├── sample_1.cif │ └── sample_1.json └── 8abc/ <-- pdb_dir └── ...Here is an implementation example for the typical directory structure mentioned above:
from pathlib import Path from benchmark.evaluators.base import BaseEvaluator from benchmark.evaluators.my_model.config import RANKER_KEYS class MyModelEvaluator(BaseEvaluator): def __init__(self, **kwargs): super().__init__(**kwargs) self.ranker = RANKER_KEYS def _get_info_from_each_pdb_dir(self, pdb_dir: Path) -> list: """ Args: pdb_dir: Path to the directory containing predictions for a specific PDB ID. Returns: A list of tuples, each representing one prediction sample. """ name = pdb_dir.name # e.g., "7rss" pdb_id = name sub_data = [] # Iterate through seed directories: e.g., seed_1, seed_2 for seed_dir in pdb_dir.iterdir(): if not seed_dir.is_dir(): continue seed = seed_dir.name # Find all .cif files and their corresponding .json confidences for cif_path in seed_dir.glob("*.cif"): # Assume filename is "sample_1.cif" sample = cif_path.stem.split("_")[-1] conf_path = seed_dir / f"sample_{sample}.json" if conf_path.exists(): sub_data.append(( name, pdb_id, seed, sample, cif_path, conf_path, None # model_chain_id_to_lig_mol )) return sub_data
The system will automatically discover your new model. You can then use it by passing
-m my_modeltobenchmark.run_eval. -
Verify your implementation: Use the provided verification script to ensure your evaluator is correctly implemented and can process samples end-to-end:
python -m benchmark.scripts.verify_evaluator \ -m my_model \ -i /path/to/your/predictions \ -t /path/to/reference/mmcifs \ --limit 5This script will randomly sample up to
limittasks and run a trial evaluation in a temporary directory.
The aggregation script internally executes two major stages:
-
Per-dataset JSON → Parquet conversion
For each evaluation dataset referenced in
dataset_path, the script:-
Scans the per-sample JSON evaluation outputs (one JSON per
(entry_id, seed, sample)). -
Normalizes them into a single tabular structure containing one row per:
- complex (
type == "complex"), - chain (
type == "chain"), - interface (
type == "interface").
- complex (
-
Writes a
*_metrics.parquetfile (and, for PoseBusters validity checks, an additional*_pb_valid.parquetfile) in the same directory as the evaluation outputs.
This Parquet file becomes the canonical source for all downstream metrics. It includes:
- identifiers:
entry_id,seed,sample, - chain/interface identifiers:
chain_id_1,chain_id_2,entity_id_1,entity_id_2,entity_type_1,entity_type_2, - metric columns (e.g., LDDT, DockQ, RMSD),
- model-specific ranking scores (used by rankers).
-
-
Intersection and subset selection
When aggregating across multiple model configurations and datasets, the script:
-
computes a
match_key = (entry_id, chain_id_1, chain_id_2)for every row, -
intersects all datasets at this key, ensuring that only samples present in all selected datasets are retained,
-
optionally filters by:
- PDB ID list (
pdb_id_list_file), - seeds configured for each model in
input_json, - subset definitions in
subset_csv(including optionalcluster_idif provided).
- PDB ID list (
For the RecentPDB dataset, cluster annotations (e.g., low-homology clusters) are added on-the-fly using the cluster metadata CSV shipped with the benchmark. For Custom datasets, if
subset_csvcontains acluster_idcolumn, that cluster ID is attached to each matching row. -
-
Per-dataset metric computation
Once the intersection DataFrames are prepared, metrics are computed by dataset type:
-
RecentPDB
- Uses both DockQ (for protein-protein interfaces) and LDDT (for intra-chain and interface-level evaluation).
- Computes ligand and pocket RMSD for protein–ligand chains when the underlying metrics table contains
lig_rmsd/pocket_rmsd, and aggregates them over low-homology pocket clusters in the same two-stage way as other metrics. - Computes CDR-H3 backbone RMSD for antibody heavy chains when the underlying metrics table contains
cdr_h3_bb_rmsd. Unlike other metrics, this is averaged directly over all chains without clustering. - Subsets by antibody vs non-antibody, monomer, peptide, cyclic peptide, etc., using the subset labels in the metadata CSV.
-
AF3-AB
- Selects AF3 antibody-antigen interfaces using the AF3-AB metadata.
- Computes DockQ and LDDT metrics for these antibody-protein interfaces only.
-
PoseBusters
- Combines RMSD metrics (ligand and pocket) with PoseBusters validity checks from the
*_pb_validtable. - Computes success rates (e.g., RMSD below threshold) and provides penalized variants of rankers that discount chemically invalid structures.
- Combines RMSD metrics (ligand and pocket) with PoseBusters validity checks from the
-
dsDNA-Protein
- Uses DNA-Protein LDDT, averaging LDDT across both DNA chains for each complex to obtain a single
dsDNA-Proteinscore per case.
- Uses DNA-Protein LDDT, averaging LDDT across both DNA chains for each complex to obtain a single
-
RNA-Protein
- Uses RNA-protein LDDT on interfaces tagged as
"RNA-Protein".
- Uses RNA-protein LDDT on interfaces tagged as
-
Custom
- Uses LDDT for whatever chain/interface types are present, without built-in subset semantics.
- If the per-sample metrics table contains a ligand RMSD column named
lig_rmsd, RMSD-based aggregate metrics are also computed, and the correspondingRMSD_results.csv/RMSD_details.csvfiles are populated for this Custom dataset.
-
Result consolidation
-
Per-dataset summary and detail DataFrames are concatenated across model configurations.
-
Each result row is annotated with:
name(model configuration key frominput_json),eval_dataset(e.g.,RecentPDB,PoseBusters,AF3-AB),eval_type(e.g.,Intra-Protein,Protein-Protein,RNA-Protein,dsDNA-Protein),subset(e.g.,All,Antibody=True,Monomer,Peptide),ranker(how the final sample per case was selected).
-
This design ensures efficient, reproducible evaluation while supporting incremental updates.
After aggregation, the evaluation pipeline produces two levels of outputs:
- Global summary tables under the directory specified by
output_path(see below for file layout). - Per-dataset result and detail CSVs written into each evaluation directory listed in
dataset_pathin the input JSON. - For each evaluation dataset, a cached
<eval_dataset>_metrics.parquetfile is created next to the raw JSON outputs. This Parquet file stores all per-sample metrics and is reused as the canonical data source for subsequent aggregations.
A typical output_path directory (default: ./pxm_results) looks like:
pxm_results
├── DockQ_details.csv
├── DockQ_results.csv
├── LDDT_details.csv
├── LDDT_results.csv
├── RMSD_details.csv
├── RMSD_results.csv
├── Summary_table_ranked.csv
├── Summary_table.csv
└── Summary_table.txt
-
Summary_table.csv/Summary_table.txtProvide a high-level summary of the benchmark performance, including metrics such as DockQ success rate, PoseBusters success rate, and LDDT. TheSummary_tablefiles use abbreviations for evaluation types; for example, protein-protein interfaces are shown asprot_prot(other interface types follow similar abbreviation patterns). -
Summary_table_ranked.csvA ranked version ofSummary_table.csv, with evaluation types shown in full (no abbreviations). Each row corresponds to a single(name, eval_dataset, eval_type, subset, ranker)combination, and the table is sorted by the main metric so that the best-performing configuration appears first. -
*_results.csvContain aggregated per-sample metrics derived from the*_metrics.parquetfile. These represent the final evaluation metrics after intersecting all selected datasets and applying the chosen ranker. -
*_details.csvContain sample-level details, including the metrics and identifiers used by rankers to select the final prediction per case. Typical columns include:id,name,eval_dataset,eval_type,subset,rankerentry_id,seed,samplechain_id_1,chain_id_2,entity_id_1,entity_id_2,entity_type_1,entity_type_2cluster_id, and all relevant metric columns (DockQ, LDDT, RMSD, PoseBusters checks, model scores, etc.)
These files are useful for in-depth debugging and analysis of individual examples.
The aggregation code distinguishes between summary metrics (one row per dataset / eval type / subset / ranker) and detail rows (one row per evaluated chain or interface). The most important metrics are:
-
DockQ-based metrics (for protein-protein interfaces on RecentPDB and AF3-AB):
- Computed from per-interface DockQ scores aggregated at the level of structure clusters.
avg_dockq_avg_sr: mean DockQ value across clusters, then converted to a success rate using a default DockQ threshold (0.23 by default).avg_dockq_sr_avg_sr: average success rate when thresholding DockQ within each cluster first.- Confidence intervals are reported as
ci_avg_dockq_avg_sr(exact binomial interval over clusters) andci_avg_dockq_sr_avg_sr(bootstrap interval).
-
LDDT-based metrics (chain-level intra metrics and interface metrics):
- For
Intra-*eval types, LDDT is computed per chain; for other eval types it is computed per interface. LDDT-PLIis computed for ligands, measuring the local distance agreement of the ligand and its binding pocket.- For each cluster, LDDT values are averaged across all entries in the cluster.
- The summary column
lddtstores the mean of these per-cluster averages;ci_lddtstores a bootstrap confidence interval. - Note: LDDT-PLI scores are aggregated per chain (entry_id + chain_id), not averaged by cluster_id.
- For
-
RMSD-based metrics (ligand quality)
lig_avg_rmsd: mean ligand pocket-aligned RMSD across all ligand chains.lig_rmsd_sr: success rate under the ligand RMSD threshold (2.0 Å by default).Ligand SR: success rate where Ligand RMSD < 2.0 Å AND LDDT-PLI > 0.8.- PoseBusters validity checks (e.g., steric-clash violations, tetrahedral-chirality correctness) are included as additional columns; when present, they are also used to construct penalized rankers that downweight chemically invalid poses.
pb_all_valid_sr: success rate of passing all PoseBusters checks.pb_all_valid_and_good_rmsd_sr: success rate of simultaneously passing all PoseBusters checks and the ligand RMSD threshold (2.0 Å by default).
These RMSD-based metrics are used for:
- the PoseBusters dataset (where PB validity checks are always present),
- the RecentPDB dataset when ligand RMSD columns (
lig_rmsd,pocket_rmsd) are available in the per-sample metrics table, and - any Custom dataset that provides a
lig_rmsdcolumn in its per-sample metrics table.
Note: When using the
RecentPDBdataset, all the metrics above are computed in a two-stage manner: a within-cluster average over pocket-chaincluster_id, followed by an across-cluster average. -
Framework-aligned CDR-H3 loop backbone RMSD metrics (antibody CDR-H3 quality)
cdr_h3_bb_avg_rmsd: mean CDR-H3 backbone RMSD across all antibody heavy chains.cdr_h3_bb_rmsd_sr: success rate under the CDR-H3 RMSD threshold (1.0 Å by default).
Note: Unlike other metrics in RecentPDB, this metric is aggregated by
{entry_id}_{entity_id_1}instead of sequence clusters. The averages and success rates are computed first within each{entry_id}_{entity_id_1}group, and then averaged across all groups.This metric is computed when
calc_cdr_h3_bb_rmsdis enabled in the configuration.
For each metric type, multiple rankers are reported:
- Generic rankers such as
best,worst,rand, andmedianselect a single sample per case using the target metric. - Model-specific rankers (e.g.,
best.<score_name>orbest.<score_name>.penalized) use internal model scores as selectors, allowing you to compare how well a model's own ranking correlates with the external evaluation metrics.
The *_details.csv files expose the per-case fields (seed, sample, entry_id, chain_id_1, chain_id_2, cluster_id, etc.), so that you can reconstruct or further analyze any specific example reported in the summary tables.