Feat/add folmsbee conformer benchmark by lwehrhan · Pull Request #429 · ddmms/ml-peg

lwehrhan · 2026-03-16T16:03:38Z

Pre-review checklist for PR author

PR author must check the checkboxes below when creating the PR.

I've confirmed the contribution guidelines.

Summary

The Folmsbee dataset of low-energy conformers of drug-like molecules. The differences in energy are smaller compared to the Wiggle500 dataset and it features a greater number of molecules. The highest available level of theory for energy evaluations to be used as ground-truth is DLPNO-CCSD(T). This is a test for moving the benchmarks of mlip-audit into this repository. I have included an analysis script for this benchmark, however would like to kindly ask for assistance with building and harmonizing the Dash layout.

Linked issue

Resolves #427

Progress

Calculations
Analysis
Application
Documentation

Testing

New decorators/callbacks

joehart2001 · 2026-03-18T14:22:51Z

Hi @lwehrhan, thank you for your PR and its looking great overall! A few things:

would you be able to share the data file so i can uplaod it to our s3 bucket so i can test the calc and analysis is running as expected?
i have pushed the app and also the metrics.yml, would you be able to check over this metrics file to make sure its correct?

Once we've got the data file uploaded, i think we can make a few changes to the calc script for consistency with similar benchmarks, but i think the changes will be minor.

Just a note, make sure you to fetch any changes i've made before working locally, otherwise your next push may overwrite my changes.

thanks!

joehart2001 · 2026-04-13T14:17:09Z

Hey @lwehrhan, we've just merged a PR so that you can tag your benchmark with the mlip-auditl, add your logo and have your own dedicated tab (PR #434). Please see the framework credit tags docs

…b.com/lwehrhan/ml-peg into feat/add-folmsbee-conformer-benchmark

lwalew · 2026-06-02T16:03:14Z

+            i = int(conf_str)
+            molecule = result_by_name[mol_name]
+
+            results[model_name].append(float(molecule.predicted_energy_profile[i]))


This might be None if a molecule had an unsupported element.

This would be an actual problem in this case. I think we will have to drop the molecules with unsupported elements or skip the benchmark entirely in that case (as in mlip audit). We cannot test for supported elements because the calculators do not expose the supported elements.

joehart2001 · 2026-06-20T14:11:28Z

Hey @lwehrhan what’s the status on the unresolved comments? I’ll tag @ElliottKasoar so he can take a look over the benchmark

lwehrhan · 2026-06-24T07:26:32Z

@joehart2001 I have addressed the comments now. The only open issue (this will also apply to some of the other benchmarks) is that we cannot test the supported elements of the models prior to running the benchmark like we do in mlip audit, which sometimes may lead to bugs. I saw this is handled "on-the-fly" e.g. in physicality/diatomics. Here, not supporting molecules is not penalized. Is that the expected behavior for these benchmarks here?

ElliottKasoar · 2026-06-24T10:52:44Z

@joehart2001 I have addressed the comments now. The only open issue (this will also apply to some of the other benchmarks) is that we cannot test the supported elements of the models prior to running the benchmark like we do in mlip audit, which sometimes may lead to bugs. I saw this is handled "on-the-fly" e.g. in physicality/diatomics. Here, not supporting molecules is not penalized. Is that the expected behavior for these benchmarks here?

Hi @leonwehrhan, can you please take a look at the (very new) guidance for element filtering: https://ddmms.github.io/ml-peg/developer_guide/filter.html.

Essentially, in an ideal world, we'd catch any errors that occur during calculations, so you shouldn't need to know what calculators support, but have an awareness that they failed for certain molecules, which could be due to element support or otherwise.

Then in the analysis, if it fails for any of the molecules, the model can't do the benchmark, so it would get a score of NaN/None. The current implementation of filtering allows us to exclude this benchmark if any matches are found to elements required in the entire benchmark, but the aim is to allow specific molecules to be excluded - see #625 for an example.

If it wouldn't be possible to set things up to do this partial filtering (e.g. if you can't catch errors and continue for other molecules), it's also fine to only implement the current 'binary' filter, but we would still rather say a model can't do a benchmark than have the scores represent different things for different models.

The diatomics example is probably not the best guide, since it came before we added filtering, and so the approach is not quite as consistent.

ElliottKasoar · 2026-06-24T15:41:17Z

+    good: 0.0
+    bad: 20.0


How much consideration has gone into these/are you familiar with the principles we have in mind for setting them?

If you're happy with them, great! We're happy to clarify/discuss more though

Hi @ElliottKasoar. These were indeed placeholders copied from the DipCONFS benchmark, I think a "bad" threshold of 2.0 kcal/mol is more sensible. For the final scoring, we use different scoring functions in MLIP Audit that differ per benchmark, e.g. here working with a per-molecule threshold. I have pushed an update to the metrics.yaml and analyse_folmsbee.py files that would read the MLIP Audit scores directly from the benchmark analysis and use this for the final score computation. This would allow easier maintainability from our side and also align the benchmark scores of MLIP audit and ML-PEG. MAE and other metrics are still presented in the metrics table. Is this acceptable for the ML-PEG implementation of the MLIP audit benchmarks?

ElliottKasoar · 2026-06-24T16:17:20Z

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

ElliottKasoar · 2026-06-30T16:26:58Z

Also note there is now a small conflict due to changes we've made to frameworks.yml. Please could this be resolved?

…former-benchmark

lwehrhan · 2026-07-01T17:30:53Z

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

my understanding is that this would mean dropping the per-conformer structure viewer?

joehart2001 · 2026-07-01T19:58:37Z

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

my understanding is that this would mean dropping the per-conformer structure viewer?

we can still visualise structures with the denity plots (check e.g. elasticity benchmark), but we could also use the plot_from_table_cell decorator to do a single scatter plot per model instead of all the models' scatters being combined into one. opinions @ElliottKasoar ?

leonwehrhan added 3 commits March 10, 2026 18:36

feat: calc folmsbee

b542fd3

fix: units

e82bc26

feat: add analysis script

cd4e9da

alinelena requested review from ElliottKasoar and joehart2001 March 16, 2026 17:14

joehart2001 reviewed Mar 18, 2026

View reviewed changes

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py Outdated

joehart2001 reviewed Mar 18, 2026

View reviewed changes

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py

add flomsbee app and metrics.yml

e03bf30

joehart2001 added the new benchmark Proposals and suggestions for new benchmarks label Mar 19, 2026

leonwehrhan and others added 5 commits May 12, 2026 16:28

feat: calc folmsbee

e2ab058

fix: units

398fd9b

feat: add analysis script

d682fb0

add flomsbee app and metrics.yml

894c1ef

s3 download for calc, calc analysis and app fixes

efb1ea1

joehart2001 force-pushed the feat/add-folmsbee-conformer-benchmark branch from e03bf30 to efb1ea1 Compare May 12, 2026 15:29

joehart2001 and others added 3 commits May 12, 2026 16:38

add framework details

3fad5b7

Merge branch 'feat/add-folmsbee-conformer-benchmark' of https://githu…

c484172

…b.com/lwehrhan/ml-peg into feat/add-folmsbee-conformer-benchmark

feat: use mlip audit benchmark classes

caf5786

lwalew reviewed May 19, 2026

View reviewed changes

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py Outdated

lwalew reviewed May 19, 2026

View reviewed changes

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py

lwalew approved these changes Jun 2, 2026

View reviewed changes

lwalew reviewed Jun 2, 2026

View reviewed changes

Comment thread pyproject.toml

leonwehrhan added 5 commits June 23, 2026 16:55

docs: add folmsbee docs

e0a3fd2

chore: address comments

7865726

feat: update MAE per molecule calculation

c04d09b

feat: add logo

be2a8af

docs: update computational cost

d5dbfba

chore: revert data loading

a52d70d

joehart2001 mentioned this pull request Jun 24, 2026

feat: add md stability benchmark #644

Open

5 tasks

ElliottKasoar reviewed Jun 24, 2026

View reviewed changes

leonwehrhan added 2 commits July 1, 2026 17:59

Merge remote-tracking branch 'origin/main' into feat/add-folmsbee-con…

45af22f

…former-benchmark

fix: udate metrics to include mlip audit conformer score

355dbd6

fix: input filename and calculator precision

bddec04

Uh oh!

Conversation

lwehrhan commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pre-review checklist for PR author

Summary

Linked issue

Progress

Testing

New decorators/callbacks

Uh oh!

joehart2001 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

joehart2001 commented Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lwalew Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

lwehrhan Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

joehart2001 commented Jun 20, 2026

Uh oh!

lwehrhan commented Jun 24, 2026

Uh oh!

ElliottKasoar commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ElliottKasoar Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

lwehrhan Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

ElliottKasoar commented Jun 24, 2026

Uh oh!

ElliottKasoar commented Jun 30, 2026

Uh oh!

lwehrhan commented Jul 1, 2026

Uh oh!

joehart2001 commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lwehrhan commented Mar 16, 2026 •

edited

Loading

joehart2001 commented Mar 18, 2026 •

edited

Loading

ElliottKasoar commented Jun 24, 2026 •

edited

Loading