This repository provides evaluation codes for assessing models using our curated evaluation sets:
| Dataset | Description | Metrics |
|---|---|---|
| RecentPDB | Evaluates RecentPDB low homology subset. Antibody-antigen and monomer subsets reported separately. | DockQ success rates (>0.23), LDDT |
| AF3-AB | Analyses antibody-antigen subset of AlphaFold3. | DockQ success rates (>0.23), LDDT |
| dsDNA-Protein | Focuses on intra-DNA chains and DNA-Protein interfaces, aggregating LDDT metrics. | LDDT |
| RNA-Protein | Evaluates intra-DNA chains and RNA-protein interfaces with LDDT aggregation. | LDDT |
| PoseBusters | Assesses pocket-aligned RMSD of small molecules (PoseBusters V2). | RMSD success rates (< 2 Å) |
The benchmark data is licensed under CC0.
Before running the code for the first time, download the necessary dataset files:
wget https://pxmeter.tos-cn-beijing.volces.com/evaluation_supported_data.tar.gz
tar xzvf evaluation_supported_data.tar.gz -C [your_path]To download the full dataset used in our article, including model input files, inference output structures, evaluation output JSON files, and summaries:
wget https://pxmeter.tos-cn-beijing.volces.com/evaluation_full_data.tar.gz
tar xzvf evaluation_full_data.tar.gz -C [your_path]You can place the extracted evaluation folder in the directory from which you run the benchmark code, or specify its location via an environment variable:
export PXM_EVAL_DATA_ROOT_PATH="your/path/to/evaluation"Run the evaluation script as follows:
python benchmark/run_eval.py -i [infer_dir] -o [output_dir] -d [dataset] -c [chunk_str] -m [model] -n [num_cpu]infer_dir: Directory containing inference results (CIF files and confidence files).output_dir: Directory where evaluation results will be saved.dataset: Dataset to evaluate (options: "RecentPDB", "PoseBusters", "dsDNA-Protein", "RNA-Protein", "AF3-AB").chunk_str: Chunk string for distributed evaluation (e.g., '1of5'), used when running evaluations across multiple machines. The default is None.model: Model name should be one of the supported options: "protenix", "boltz", "chai" or "af2m", as defined in "benchmark.run_eval.run_batch_eval", which calls corresponding Evaluators.num_cpu: Number of CPU cores to use. Default is 1.
Example JSON structure:
{
"name": {
"model": "model name (protenix, af2m, chai, boltz)",
"seeds": [101, 102, "..."],
"dataset_path": {
"RecentPDB": "path/to/eval_results/RecentPDB",
"PoseBusters": "path/to/eval_results/PoseBusters",
"AF3-AB": "path/to/eval_results/AF3_AB"
}
}
}Allowed keys for dataset_path are:
"RecentPDB", "PoseBusters", "dsDNA-Protein", "RNA-Protein", and "AF3-AB".
Each dataset can include only a subset of these keys.
Run the aggregation script: python benchmark/show_intersection_results.py -i [input_json] -o [output_path] -d [dataset_names] -n [num_cpu] -c [subset_csv] --overwrite_agg
Parameters:
input_json: Path to JSON file with evaluation results (created in step 2).output_path(optional): Directory for final aggregated CSV output. Defaults to "./pxm_results".dataset_names(optional): Comma-separated dataset names to compare using intersections of "chain" and "interface". Defaults to all names in JSON.num_cpu(optional): Number of CPU cores for aggregation. Defaults to all available cores.overwrite_agg(optional): Overwrite existing aggregated CSV files. Defaults to False.subset_csv(optional): CSV file with columns ["type", "entry_id", "chain_id_1", "chain_id_2"]. Used to subset results. "type" can be "chain" or "interface".
The script outputs summary CSV files in the directories specified by dataset_path in the JSON and creates a consolidated CSV with aggregated metrics in output_path:
pxm_results
├── DockQ_details.csv
├── DockQ_results.csv
├── LDDT_details.csv
├── LDDT_results.csv
├── RMSD_details.csv
├── RMSD_results.csv
├── Summary_table.csv
└── Summary_table.txtSummary_table.csvandSummary_table.txtprovide a concise overview of key metrics such as DockQ success rate, PoseBusters success rate, and LDDT (with protein-protein interfaces indicated as prot_prot).*_results.csvfiles provide a full view of aggregated evaluation metrics.*_details.csvfiles allow exploration of sample-specific metrics selected by each ranker.
This section outlines steps to construct the RecentPDB Low Homology dataset. File paths required for scripts are specified in benchmark.configs.data_config. It’s recommended to specify new output paths in scripts to prevent overwriting default files.
Run this command to filter RecentPDB entries:
python benchmark/scripts/filter_for_recentpdb.py -c [mmcif_dir] -o [chain_interface_csv] -m [meta_csv] -n [num_cpu]mmcif_dir: Directory with MMCIF files downloaded from RCSB PDB.chain_interface_csv: Output CSV file recording filtered chain and interface information for RecentPDB dataset.meta_csv: Output CSV file containing meta information of filtered entries.num_cpu: Number of CPU cores for processing. Recommended CPU-to-memory (GB) ratio is 1:4.
Run this command to filter to the Low Homology subset:
python benchmark/scripts/filter_to_lowh.py -c [chain_interface_csv] -t [test_to_train_json] -o [output_protein_lowh_csv] -n [output_nuc_lowh_csv]chain_interface_csv: Input CSV with filtered chain and interface info from RecentPDB.test_to_train_json: A JSON file containing a dictionary where keys are entities from the test set in the format '{PDB ID}_{Entity ID}', and values are lists of similar entities (high homology) from the train set in the same format.output_protein_lowh_csv: Output CSV with filtered chain and interface info for RecentPDB Low Homology Protein subset.output_nuc_lowh_csv: Output CSV with filtered chain and interface info for RecentPDB Low Homology Nucleic Acids subset.