Summary
When using SAFEConverter.encoder() with scaffolds containing attachment points [*], the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.
Environment
- SAFE version: (from finetune_safe fork based on datamol-io/safe)
- Python version: 3.10
- Usage context: RL fine-tuning with scaffold-constrained generation
Reproduction
from safe import SAFEConverter
import safe as sf
encoder = SAFEConverter()
# Purine scaffold with attachment point
scaffold_smiles = 'O=c1[nH]cnc2nc([*])ccc12'
# Encode with fragmentation disabled (as done in scaffold_decoration)
with sf.utils.attr_as(encoder, 'slicer', None):
encoded = encoder.encoder(scaffold_smiles, allow_empty=True)
print(f'Original: {scaffold_smiles}') # O=c1[nH]cnc2nc([*])ccc12
print(f'Encoded: {encoded}') # O=c1[nH]cnc2nc3ccc12 ← [*] became 3!
print(f'Decoded: {encoder.decoder(encoded)}') # O=c1[nH]cnc2ncccc12
Output:
Original: O=c1[nH]cnc2nc([*])ccc12 ← Has [*] attachment point
Encoded: O=c1[nH]cnc2nc3ccc12 ← [*] replaced with ring closure 3
Decoded: O=c1[nH]cnc2ncccc12 ← Complete closed molecule
Additional Test Cases
Simple benzene:
scaffold = 'c1ccc([*])cc1'
# Encoded: c1ccc2cc1 ← [*] removed
# Decoded: c1ccccc1 ← Closed ring
Multiple attachment points:
scaffold = 'c1cc([*])ccc1[*]' # 2 attachment points
# Encoded: c1cc2ccc13 ← Both [*] became ring closures
# Decoded: c1ccccc1 ← All attachment points lost
Impact
This issue affects the scaffold_decoration() method in safe/sample.py:
scaffold_decoration() calls _completion() (line 662)
_completion() encodes the scaffold using the same parameters (lines 325-332):
with sf.utils.attr_as(self.safe_encoder, "slicer", None):
encoded_fragment = self.safe_encoder.encoder(
fragment,
canonical=False,
randomize=True,
constraints=None,
allow_empty=True, # ← Same parameter
seed=new_seed,
)
- The encoded scaffold (with attachment points removed) is used as a generation prefix
- The model generates from a complete closed molecule instead of decorating attachment points
Result: When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.
Questions
- Is this the intended behavior for
scaffold_decoration()?
- How should scaffolds with
[*] be encoded for use as generation prefixes while preserving attachment points?
- Is there a way to preserve attachment points during SAFE encoding?
- Should we use a different approach for scaffold-constrained generation in RL settings?
Use Case
We're using SAFE for RL-based molecular optimization with scaffold constraints:
- User provides scaffold SMILES with
[*] attachment points
- Scaffold should be SAFE-encoded for use as training prefix
- Model should learn to generate molecules that start with the scaffold and add decorations at attachment points
- Current behavior: Model generates from closed scaffolds, produces minimal molecules
Is scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?
Test Script
I've created a comprehensive test script that demonstrates the issue:
#!/usr/bin/env python
"""Test SAFE scaffold encoding with attachment points."""
from safe import SAFEConverter
import safe as sf
encoder = SAFEConverter()
test_cases = [
'O=c1[nH]cnc2nc([*])ccc12', # Purine
'c1ccc([*])cc1', # Benzene
'c1cc([*])ccc1[*]', # Dual attachment
]
for scaffold in test_cases:
with sf.utils.attr_as(encoder, 'slicer', None):
encoded = encoder.encoder(scaffold, allow_empty=True)
decoded = encoder.decoder(encoded)
print(f"Original: {scaffold}")
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
print(f"[*] preserved: {('[*]' in decoded)}")
print()
All test cases lose their attachment points during encoding.
Summary
When using
SAFEConverter.encoder()with scaffolds containing attachment points[*], the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.Environment
Reproduction
Output:
Additional Test Cases
Simple benzene:
Multiple attachment points:
Impact
This issue affects the
scaffold_decoration()method insafe/sample.py:scaffold_decoration()calls_completion()(line 662)_completion()encodes the scaffold using the same parameters (lines 325-332):Result: When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.
Questions
scaffold_decoration()?[*]be encoded for use as generation prefixes while preserving attachment points?Use Case
We're using SAFE for RL-based molecular optimization with scaffold constraints:
[*]attachment pointsIs scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?
Test Script
I've created a comprehensive test script that demonstrates the issue:
All test cases lose their attachment points during encoding.