Skip to content

Feat/add folmsbee conformer benchmark#429

Open
lwehrhan wants to merge 21 commits into
ddmms:mainfrom
lwehrhan:feat/add-folmsbee-conformer-benchmark
Open

Feat/add folmsbee conformer benchmark#429
lwehrhan wants to merge 21 commits into
ddmms:mainfrom
lwehrhan:feat/add-folmsbee-conformer-benchmark

Conversation

@lwehrhan

@lwehrhan lwehrhan commented Mar 16, 2026

Copy link
Copy Markdown

Pre-review checklist for PR author

PR author must check the checkboxes below when creating the PR.

Summary

The Folmsbee dataset of low-energy conformers of drug-like molecules. The differences in energy are smaller compared to the Wiggle500 dataset and it features a greater number of molecules. The highest available level of theory for energy evaluations to be used as ground-truth is DLPNO-CCSD(T). This is a test for moving the benchmarks of mlip-audit into this repository. I have included an analysis script for this benchmark, however would like to kindly ask for assistance with building and harmonizing the Dash layout.

Linked issue

Resolves #427

Progress

  • Calculations
  • Analysis
  • Application
  • Documentation

Testing

New decorators/callbacks

@joehart2001

joehart2001 commented Mar 18, 2026

Copy link
Copy Markdown
Collaborator

Hi @lwehrhan, thank you for your PR and its looking great overall! A few things:

  1. would you be able to share the data file so i can uplaod it to our s3 bucket so i can test the calc and analysis is running as expected?
  2. i have pushed the app and also the metrics.yml, would you be able to check over this metrics file to make sure its correct?

Once we've got the data file uploaded, i think we can make a few changes to the calc script for consistency with similar benchmarks, but i think the changes will be minor.

Just a note, make sure you to fetch any changes i've made before working locally, otherwise your next push may overwrite my changes.

thanks!

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py Outdated
Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py
@joehart2001 joehart2001 added the new benchmark Proposals and suggestions for new benchmarks label Mar 19, 2026
@joehart2001

Copy link
Copy Markdown
Collaborator

Hey @lwehrhan, we've just merged a PR so that you can tag your benchmark with the mlip-auditl, add your logo and have your own dedicated tab (PR #434). Please see the framework credit tags docs

@joehart2001 joehart2001 force-pushed the feat/add-folmsbee-conformer-benchmark branch from e03bf30 to efb1ea1 Compare May 12, 2026 15:29
Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py Outdated
Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py
Comment thread ml_peg/analysis/conformers/Folmsbee/analyse_Folmsbee.py
i = int(conf_str)
molecule = result_by_name[mol_name]

results[model_name].append(float(molecule.predicted_energy_profile[i]))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be None if a molecule had an unsupported element.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be an actual problem in this case. I think we will have to drop the molecules with unsupported elements or skip the benchmark entirely in that case (as in mlip audit). We cannot test for supported elements because the calculators do not expose the supported elements.

Comment thread pyproject.toml
@joehart2001

Copy link
Copy Markdown
Collaborator

Hey @lwehrhan what’s the status on the unresolved comments? I’ll tag @ElliottKasoar so he can take a look over the benchmark

@lwehrhan

Copy link
Copy Markdown
Author

@joehart2001 I have addressed the comments now. The only open issue (this will also apply to some of the other benchmarks) is that we cannot test the supported elements of the models prior to running the benchmark like we do in mlip audit, which sometimes may lead to bugs. I saw this is handled "on-the-fly" e.g. in physicality/diatomics. Here, not supporting molecules is not penalized. Is that the expected behavior for these benchmarks here?

@ElliottKasoar

ElliottKasoar commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

@joehart2001 I have addressed the comments now. The only open issue (this will also apply to some of the other benchmarks) is that we cannot test the supported elements of the models prior to running the benchmark like we do in mlip audit, which sometimes may lead to bugs. I saw this is handled "on-the-fly" e.g. in physicality/diatomics. Here, not supporting molecules is not penalized. Is that the expected behavior for these benchmarks here?

Hi @leonwehrhan, can you please take a look at the (very new) guidance for element filtering: https://ddmms.github.io/ml-peg/developer_guide/filter.html.

Essentially, in an ideal world, we'd catch any errors that occur during calculations, so you shouldn't need to know what calculators support, but have an awareness that they failed for certain molecules, which could be due to element support or otherwise.

Then in the analysis, if it fails for any of the molecules, the model can't do the benchmark, so it would get a score of NaN/None. The current implementation of filtering allows us to exclude this benchmark if any matches are found to elements required in the entire benchmark, but the aim is to allow specific molecules to be excluded - see #625 for an example.

If it wouldn't be possible to set things up to do this partial filtering (e.g. if you can't catch errors and continue for other molecules), it's also fine to only implement the current 'binary' filter, but we would still rather say a model can't do a benchmark than have the scores represent different things for different models.

The diatomics example is probably not the best guide, since it came before we added filtering, and so the approach is not quite as consistent.

Comment thread ml_peg/calcs/conformers/Folmsbee/calc_Folmsbee.py
Comment thread ml_peg/analysis/conformers/Folmsbee/analyse_Folmsbee.py Outdated
Comment on lines +3 to +4
good: 0.0
bad: 20.0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much consideration has gone into these/are you familiar with the principles we have in mind for setting them?

If you're happy with them, great! We're happy to clarify/discuss more though

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ElliottKasoar. These were indeed placeholders copied from the DipCONFS benchmark, I think a "bad" threshold of 2.0 kcal/mol is more sensible. For the final scoring, we use different scoring functions in MLIP Audit that differ per benchmark, e.g. here working with a per-molecule threshold. I have pushed an update to the metrics.yaml and analyse_folmsbee.py files that would read the MLIP Audit scores directly from the benchmark analysis and use this for the final score computation. This would allow easier maintainability from our side and also align the benchmark scores of MLIP audit and ML-PEG. MAE and other metrics are still presented in the metrics table. Is this acceptable for the ML-PEG implementation of the MLIP audit benchmarks?

@ElliottKasoar

Copy link
Copy Markdown
Collaborator

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

@ElliottKasoar

Copy link
Copy Markdown
Collaborator

Also note there is now a small conflict due to changes we've made to frameworks.yml. Please could this be resolved?

@lwehrhan

lwehrhan commented Jul 1, 2026

Copy link
Copy Markdown
Author

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

my understanding is that this would mean dropping the per-conformer structure viewer?

@joehart2001

Copy link
Copy Markdown
Collaborator

Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise?

my understanding is that this would mean dropping the per-conformer structure viewer?

we can still visualise structures with the denity plots (check e.g. elasticity benchmark), but we could also use the plot_from_table_cell decorator to do a single scatter plot per model instead of all the models' scatters being combined into one. opinions @ElliottKasoar ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new benchmark Proposals and suggestions for new benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: new benchmark with folmsbee conformer dataset

5 participants