Skip to content

Add tautomer filter#760

Merged
lilyminium merged 4 commits into
mainfrom
add-tautomer-filter
May 27, 2026
Merged

Add tautomer filter#760
lilyminium merged 4 commits into
mainfrom
add-tautomer-filter

Conversation

@lilyminium
Copy link
Copy Markdown
Contributor

Description

Adds functions for classifying, and filtering out, molecules that may be tautomers. The rules and examples are plotted below. There are a lot of considerations as to which and how much a form is dominant, so I don't recommend whitelisting some categories. Others, of course, are very common, like keto/enol tautomerism.

I initially considered simply removing any molecule that had possible tautomers generaetd by RDKit's EnumerateTautomers function, but that would flag very simple molecules like acetone -- hence the need to overcomplicate and classify.

As mentioned very quickly in the FF fitting meeting today, this only would really apply to a small subset of ThermoML (~400 unique molecules), and the classified molecules largley did not raise red flags, save perhaps the ANNULAR_AZOLE category. This unfortunately was me throwing my hands up at the multitude of considerations of ring nitrogen tautomers and dumping them all into the one category.

tautomer_rules tautomers-summary

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

❌ Patch coverage is 95.83333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.44%. Comparing base (adbf9e1) to head (d8c1053).
⚠️ Report is 5 commits behind head on main.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lilyminium lilyminium marked this pull request as ready for review April 1, 2026 11:20
Copy link
Copy Markdown
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working through this, but I'm leaving a partial review in case my browser is hungry enough to eat these when I come back to it later

Comment thread openff/evaluator/datasets/curation/components/filtering.py Outdated
categories_to_exclude=["BETA_DIKETONE"],
)

with pytest.raises(ValidationError):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add match=... to each of these, since Pydantic raises the same error for each failure case

Copy link
Copy Markdown
Member

@mattwthompson mattwthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving (comments are non-blocking, re-review only if you wish) leaning somewhat on this being functional in your live use cases. This is maybe 90% of the way towards a thorough review in ~half the time, a tradeoff I'm happy with given that this codebase is only somewhat public-facing and the key stakeholder appears happy with the behavior.

I didn't go through every SMARTS pattern in the rule definitions in the source code - I can if we want to - but I did look fairly closely at the tests and the behavior. This took me a few passes through to understand, including refreshing myself on organic chemistry that's been lost to memory. (I'm not suggesting process improvements, the changes are highly interconnected it's not clear how this could be broken up; most of the work is setting up the logic, so separate PRs for adding in tautomerization chemistries bit-by-bit would be ... perhaps tedious.)

Comment on lines +189 to +192
try:
tautomers.append(Molecule.from_smiles(smi, allow_undefined_stereo=True))
except Exception:
pass
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What failures are lost in this? The there cases in which Chem.MolToSmiles returns something but Molecule.from_smiles(smi, allow_undefined_stereo=True) chokes are

  • Multi-component (hopefully not triggered by now!)
  • Radicals (ditto)
  • Kekulization failures ... ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was some preemptive caution since I have seen RDKit molecules fail to round trip before due to some issue with the bond table. We could probably undo the try/except -- I tested the public ThermoML I had cached and it didn't seem to run into any issues in the end.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either approach is fine by me - in principle undoing it would be better but it may not be worth the uncertainty

lilyminium and others added 2 commits April 16, 2026 12:16
Co-authored-by: Matt Thompson <matt.thompson@openforcefield.org>
@lilyminium lilyminium merged commit 561513d into main May 27, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants