Feat/add folmsbee conformer benchmark#429
Conversation
|
Hi @lwehrhan, thank you for your PR and its looking great overall! A few things:
Once we've got the data file uploaded, i think we can make a few changes to the calc script for consistency with similar benchmarks, but i think the changes will be minor. Just a note, make sure you to fetch any changes i've made before working locally, otherwise your next push may overwrite my changes. thanks! |
|
Hey @lwehrhan, we've just merged a PR so that you can tag your benchmark with the mlip-auditl, add your logo and have your own dedicated tab (PR #434). Please see the framework credit tags docs |
e03bf30 to
efb1ea1
Compare
…b.com/lwehrhan/ml-peg into feat/add-folmsbee-conformer-benchmark
| i = int(conf_str) | ||
| molecule = result_by_name[mol_name] | ||
|
|
||
| results[model_name].append(float(molecule.predicted_energy_profile[i])) |
There was a problem hiding this comment.
This might be None if a molecule had an unsupported element.
There was a problem hiding this comment.
This would be an actual problem in this case. I think we will have to drop the molecules with unsupported elements or skip the benchmark entirely in that case (as in mlip audit). We cannot test for supported elements because the calculators do not expose the supported elements.
|
Hey @lwehrhan what’s the status on the unresolved comments? I’ll tag @ElliottKasoar so he can take a look over the benchmark |
|
@joehart2001 I have addressed the comments now. The only open issue (this will also apply to some of the other benchmarks) is that we cannot test the supported elements of the models prior to running the benchmark like we do in mlip audit, which sometimes may lead to bugs. I saw this is handled "on-the-fly" e.g. in |
Hi @leonwehrhan, can you please take a look at the (very new) guidance for element filtering: https://ddmms.github.io/ml-peg/developer_guide/filter.html. Essentially, in an ideal world, we'd catch any errors that occur during calculations, so you shouldn't need to know what calculators support, but have an awareness that they failed for certain molecules, which could be due to element support or otherwise. Then in the analysis, if it fails for any of the molecules, the model can't do the benchmark, so it would get a score of If it wouldn't be possible to set things up to do this partial filtering (e.g. if you can't catch errors and continue for other molecules), it's also fine to only implement the current 'binary' filter, but we would still rather say a model can't do a benchmark than have the scores represent different things for different models. The diatomics example is probably not the best guide, since it came before we added filtering, and so the approach is not quite as consistent. |
| good: 0.0 | ||
| bad: 20.0 |
There was a problem hiding this comment.
How much consideration has gone into these/are you familiar with the principles we have in mind for setting them?
If you're happy with them, great! We're happy to clarify/discuss more though
There was a problem hiding this comment.
Hi @ElliottKasoar. These were indeed placeholders copied from the DipCONFS benchmark, I think a "bad" threshold of 2.0 kcal/mol is more sensible. For the final scoring, we use different scoring functions in MLIP Audit that differ per benchmark, e.g. here working with a per-molecule threshold. I have pushed an update to the metrics.yaml and analyse_folmsbee.py files that would read the MLIP Audit scores directly from the benchmark analysis and use this for the final score computation. This would allow easier maintainability from our side and also align the benchmark scores of MLIP audit and ML-PEG. MAE and other metrics are still presented in the metrics table. Is this acceptable for the ML-PEG implementation of the MLIP audit benchmarks?
|
Given how many points are in the scatter plot, I also wonder if we may want to use the density scatter instead? I think beyond a handful of models the plots will get very crowed otherwise? |
|
Also note there is now a small conflict due to changes we've made to frameworks.yml. Please could this be resolved? |
my understanding is that this would mean dropping the per-conformer structure viewer? |
we can still visualise structures with the denity plots (check e.g. elasticity benchmark), but we could also use the |
Pre-review checklist for PR author
PR author must check the checkboxes below when creating the PR.
Summary
The Folmsbee dataset of low-energy conformers of drug-like molecules. The differences in energy are smaller compared to the Wiggle500 dataset and it features a greater number of molecules. The highest available level of theory for energy evaluations to be used as ground-truth is DLPNO-CCSD(T). This is a test for moving the benchmarks of mlip-audit into this repository. I have included an analysis script for this benchmark, however would like to kindly ask for assistance with building and harmonizing the Dash layout.
Linked issue
Resolves #427
Progress
Testing
New decorators/callbacks