Add tautomer filter#760
Conversation
for more information, see https://pre-commit.ci
Codecov Report❌ Patch coverage is 🚀 New features to boost your workflow:
|
mattwthompson
left a comment
There was a problem hiding this comment.
I'm still working through this, but I'm leaving a partial review in case my browser is hungry enough to eat these when I come back to it later
| categories_to_exclude=["BETA_DIKETONE"], | ||
| ) | ||
|
|
||
| with pytest.raises(ValidationError): |
There was a problem hiding this comment.
Please add match=... to each of these, since Pydantic raises the same error for each failure case
mattwthompson
left a comment
There was a problem hiding this comment.
Approving (comments are non-blocking, re-review only if you wish) leaning somewhat on this being functional in your live use cases. This is maybe 90% of the way towards a thorough review in ~half the time, a tradeoff I'm happy with given that this codebase is only somewhat public-facing and the key stakeholder appears happy with the behavior.
I didn't go through every SMARTS pattern in the rule definitions in the source code - I can if we want to - but I did look fairly closely at the tests and the behavior. This took me a few passes through to understand, including refreshing myself on organic chemistry that's been lost to memory. (I'm not suggesting process improvements, the changes are highly interconnected it's not clear how this could be broken up; most of the work is setting up the logic, so separate PRs for adding in tautomerization chemistries bit-by-bit would be ... perhaps tedious.)
| try: | ||
| tautomers.append(Molecule.from_smiles(smi, allow_undefined_stereo=True)) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
What failures are lost in this? The there cases in which Chem.MolToSmiles returns something but Molecule.from_smiles(smi, allow_undefined_stereo=True) chokes are
- Multi-component (hopefully not triggered by now!)
- Radicals (ditto)
- Kekulization failures ... ?
There was a problem hiding this comment.
I think this was some preemptive caution since I have seen RDKit molecules fail to round trip before due to some issue with the bond table. We could probably undo the try/except -- I tested the public ThermoML I had cached and it didn't seem to run into any issues in the end.
There was a problem hiding this comment.
Either approach is fine by me - in principle undoing it would be better but it may not be worth the uncertainty
Co-authored-by: Matt Thompson <matt.thompson@openforcefield.org>
Description
Adds functions for classifying, and filtering out, molecules that may be tautomers. The rules and examples are plotted below. There are a lot of considerations as to which and how much a form is dominant, so I don't recommend whitelisting some categories. Others, of course, are very common, like keto/enol tautomerism.
I initially considered simply removing any molecule that had possible tautomers generaetd by RDKit's EnumerateTautomers function, but that would flag very simple molecules like acetone -- hence the need to overcomplicate and classify.
As mentioned very quickly in the FF fitting meeting today, this only would really apply to a small subset of ThermoML (~400 unique molecules), and the classified molecules largley did not raise red flags, save perhaps the ANNULAR_AZOLE category. This unfortunately was me throwing my hands up at the multitude of considerations of ring nitrogen tautomers and dumping them all into the one category.