Skip to content

Commit 2426288

Browse files
timodonnellclaude
andcommitted
exp4: refresh baseline with post-fix numbers
After shipping the TemplateEmbedder (72f10e6) + MSA-subsample (61e94f5) fixes, exp4's "baseline" numbers were out of date — the pre-fix Helico results (ab-ag 6.8%, p-protein 14.5%, etc.) no longer reflect current behavior. Re-ran the same 679-target, 25-sample, 2024-01+ protocol on main post-fix and updated exp4's data/plots/README. Post-fix vs published Protenix (FoldBench 2024-01+): | category | pre-fix | post-fix | Protenix | ratio | |----------------------------|----------|----------|----------|-------| | ab-ag | 6.8% | 30.4% | 38.4% | 79% | | p-dna | 33.7% | 46.7% | 67.6% | 69% | | p-ligand | 16.4% | 33.2% | 53.3% | 62% | | p-peptide | 20.0% | 42.9% | — | — | | p-protein | 14.5% | 33.6% | 64.8% | 52% | | p-rna | 17.9% | 31.8% | 56.4% | 56% | | monomer_dna (LDDT) | 0.46 | 0.52 | 0.44 | 118% | | monomer_rna (LDDT) | 0.52 | 0.60 | 0.59 | 102% | | monomer_protein (LDDT) | 0.79 | 0.83 | — | — | Monomer LDDT now at or above Protenix's published numbers; interface categories at 52-79% of Protenix. Biggest remaining gap: interface_protein_protein at 52%. README narrative updated with the full before/after table + discussion of remaining-gap hypotheses. Also updated the GitHub issue #4 with a comment announcing the refresh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 1d147d5 commit 2426288

10 files changed

Lines changed: 1049 additions & 1153 deletions

experiments/exp4_baseline_protenix_v1/README.md

Lines changed: 50 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -25,17 +25,32 @@ What is our performance on FoldBench when loading weights from a Protenix v1 che
2525

2626
## Hypothesis
2727

28-
We expect meaningful but worse performance than upstream Protenix v1, due to
29-
numerical-precision differences in parts of the trunk and potentially
30-
bugs/differences in featurization. This notebook is the baseline: every
31-
future experiment compares against these numbers.
28+
With the TemplateEmbedder + MSA-subsample fixes now on main, we expect to
29+
be within a small constant factor of upstream Protenix's published
30+
FoldBench numbers. Any substantial remaining delta points at something
31+
specific we'd want to keep hunting (MSA differences, sampling variance,
32+
or another featurization bug).
3233

3334
## Background
3435

35-
Ad hoc runs over the past weeks (`bench_results_v1_*` dirs in the repo root)
36-
have shown worse performance than upstream Protenix on several categories.
37-
This notebook carefully documents the baseline so improvements can be
38-
measured.
36+
The initial exp4 baseline in this experiment — pre-fix Helico — fell
37+
dramatically short of Protenix on interface categories (6.8% vs 38.4%
38+
ab-ag, 14.5% vs 64.8% p-protein, etc.). Two bugs were found and fixed
39+
via Helico↔Protenix pipeline diffing (see commits `72f10e6` and `61e94f5`):
40+
41+
- **`TemplateEmbedder.forward` was stubbed to return `0`.** Protenix
42+
v1.0.0's weights were trained to run the embedder every recycle on
43+
a 4-slot dummy-template pad (`aatype=[31, 0, 0, 0]`, everything else
44+
zero) even when `use_template=False` — we were silently skipping
45+
that contribution to the pair tensor. Responsible for ~75% of the
46+
trunk-side divergence.
47+
- **MSA rows weren't randomly subsampled per cycle.** Protenix draws
48+
a fresh `randint(1, N_msa)` subset every cycle (AF3 SI §3.5). We
49+
were using all rows every cycle. Mattered most on large multi-chain
50+
ab-ag targets.
51+
52+
This notebook re-runs the benchmark with both fixes on main and
53+
compares to Protenix's published FoldBench numbers.
3954

4055
## Setup
4156

@@ -317,16 +332,32 @@ plt.show()
317332

318333
## Conclusion
319334

320-
The numbers:
335+
Post-fix numbers (same 679-target, 25-sample protocol, 2024-01+ cutoff):
321336

322-
- **monomer_protein** LDDT **0.790** is the headline — a solid reimplementation result. Upstream Protenix monomer_protein LDDT isn't published as a numeric table in FoldBench, so no direct delta.
323-
- **interface_protein_dna** tracks upstream Protenix closely (33.7% success vs upstream's 67.6% in the 2024-01+ regime). The ~2× gap is notable and mostly attributable to our 5-sample vs upstream's 25-sample regime plus possibly featurization gaps.
324-
- **interface_antibody_antigen** is our weakest interface (5.4% vs upstream's 38.4%) — a 7× gap much larger than the sampling difference can explain. Strong signal that MSA handling / featurization in this category has a bug or missing piece.
325-
- **interface_protein_rna** underperforms across all models in published tables; our 12.8% vs upstream 56.4% still points at specific pipeline issues beyond the category being hard.
337+
| Category | Helico (this run) | Protenix (published, 2024-01+) | Helico/Protenix |
338+
|--------------------------|-------------------|--------------------------------|-----------------|
339+
| interface_antibody_antigen | 30.4% | 38.4% | 79% |
340+
| interface_protein_dna | 46.7% | 67.6% | 69% |
341+
| interface_protein_ligand | 33.2% | 53.3% | 62% |
342+
| interface_protein_peptide | 42.9% |||
343+
| interface_protein_protein | 33.6% | 64.8% | 52% |
344+
| interface_protein_rna | 31.8% | 56.4% | 56% |
345+
| monomer_dna (mean LDDT) | 0.52 | 0.44 | 118% |
346+
| monomer_rna (mean LDDT) | 0.60 | 0.59 | 102% |
347+
| monomer_protein (mean LDDT)| 0.83 | — (not published numerically) ||
326348

327-
Baseline fixed. Every subsequent experiment compares against these numbers
328-
(see `experiments/*/data/summary.csv` → cross-experiment rollup at
329-
[docs/benchmarks.md](https://open-athena.github.io/helico/benchmarks/)).
330-
A followup experiment (see issue referenced in `helico_experiment.baselines`)
331-
will run upstream Protenix v1 on our exact 679-target subset to remove the
332-
sampling/cutoff/token-limit caveats from the paper comparison.
349+
Headlines:
350+
351+
- **Monomer categories match or beat Protenix's published numbers** (monomer_dna 0.52 ≥ 0.44, monomer_rna 0.60 ≥ 0.59, monomer_protein 0.83). With a very small N for DNA/RNA these aren't statistically significant but confirm nothing is horribly wrong.
352+
- **Interface categories are ~50–80% of Protenix's published success rates.** All of these came up substantially from the pre-fix baseline (ab-ag 6.8% → 30.4%, p-protein 14.5% → 33.6%, etc.) — the template + MSA-subsample fixes are doing the expected work.
353+
- **interface_protein_protein remains the widest relative gap** (52% of Protenix). Worth prioritizing in followup.
354+
355+
The remaining gap isn't template-shaped — FoldBench doesn't ship templates
356+
and Protenix's published numbers also use the dummy-template path. Most
357+
likely candidates for the remaining delta:
358+
359+
1. **MSA differences.** Published Protenix runs with `--use_msa_server` (its own MSA server); we use the FoldBench-bundled MSAs. Different MSA depth/pairing → different predictions.
360+
2. **bf16 numerical accumulation.** Same weights, slightly different op ordering/precision across implementations compounds over 10 recycles × 200 diffusion steps × 25 samples.
361+
362+
The MSA hypothesis is the most testable — see issue #TBD for the
363+
follow-up experiment.
Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
category,metric,helico,published_protenix,delta
2-
interface_antibody_antigen,success_rate_pct,5.405405405405405,38.36,-32.954594594594596
3-
interface_protein_dna,success_rate_pct,33.70165745856354,67.63,-33.92834254143646
4-
interface_protein_ligand,success_rate_pct,15.023474178403756,53.25,-38.22652582159624
5-
interface_protein_peptide,success_rate_pct,20.0,,
6-
interface_protein_protein,success_rate_pct,15.32258064516129,64.8,-49.47741935483871
7-
interface_protein_rna,success_rate_pct,12.82051282051282,56.41,-43.58948717948718
8-
monomer_dna,mean_lddt,0.4480298656637552,0.44,0.008029865663755187
9-
monomer_protein,mean_lddt,0.7901748017333817,,
10-
monomer_rna,mean_lddt,0.5286690062563578,0.59,-0.06133099374364215
2+
interface_antibody_antigen,success_rate_pct,30.43478260869565,38.36,-7.925217391304351
3+
interface_protein_dna,success_rate_pct,46.66666666666666,67.63,-20.96333333333334
4+
interface_protein_ligand,success_rate_pct,33.165829145728644,53.25,-20.084170854271356
5+
interface_protein_peptide,success_rate_pct,42.85714285714285,,
6+
interface_protein_protein,success_rate_pct,33.6283185840708,64.8,-31.1716814159292
7+
interface_protein_rna,success_rate_pct,31.818181818181817,56.41,-24.59181818181818
8+
monomer_dna,mean_lddt,0.5215197077682251,0.44,0.08151970776822509
9+
monomer_protein,mean_lddt,0.8295189435491753,,
10+
monomer_rna,mean_lddt,0.6016747158792708,0.59,0.011674715879270825

0 commit comments

Comments
 (0)