Skip to content

Scaffold encoding removes [*] attachment points, breaking scaffold decoration #67

@naglemi

Description

@naglemi

Summary

When using SAFEConverter.encoder() with scaffolds containing attachment points [*], the attachment points are removed and replaced with ring closures, creating complete closed molecules. This breaks scaffold decoration because there's no attachment site left to decorate.

Environment

  • SAFE version: (from finetune_safe fork based on datamol-io/safe)
  • Python version: 3.10
  • Usage context: RL fine-tuning with scaffold-constrained generation

Reproduction

from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

# Purine scaffold with attachment point
scaffold_smiles = 'O=c1[nH]cnc2nc([*])ccc12'

# Encode with fragmentation disabled (as done in scaffold_decoration)
with sf.utils.attr_as(encoder, 'slicer', None):
    encoded = encoder.encoder(scaffold_smiles, allow_empty=True)

print(f'Original:  {scaffold_smiles}')  # O=c1[nH]cnc2nc([*])ccc12
print(f'Encoded:   {encoded}')           # O=c1[nH]cnc2nc3ccc12  ← [*] became 3!
print(f'Decoded:   {encoder.decoder(encoded)}')  # O=c1[nH]cnc2ncccc12

Output:

Original:  O=c1[nH]cnc2nc([*])ccc12  ← Has [*] attachment point
Encoded:   O=c1[nH]cnc2nc3ccc12      ← [*] replaced with ring closure 3
Decoded:   O=c1[nH]cnc2ncccc12       ← Complete closed molecule

Additional Test Cases

Simple benzene:

scaffold = 'c1ccc([*])cc1'
# Encoded:  c1ccc2cc1    ← [*] removed
# Decoded:  c1ccccc1     ← Closed ring

Multiple attachment points:

scaffold = 'c1cc([*])ccc1[*]'  # 2 attachment points
# Encoded:  c1cc2ccc13         ← Both [*] became ring closures
# Decoded:  c1ccccc1           ← All attachment points lost

Impact

This issue affects the scaffold_decoration() method in safe/sample.py:

  1. scaffold_decoration() calls _completion() (line 662)
  2. _completion() encodes the scaffold using the same parameters (lines 325-332):
    with sf.utils.attr_as(self.safe_encoder, "slicer", None):
        encoded_fragment = self.safe_encoder.encoder(
            fragment,
            canonical=False,
            randomize=True,
            constraints=None,
            allow_empty=True,  # ← Same parameter
            seed=new_seed,
        )
  3. The encoded scaffold (with attachment points removed) is used as a generation prefix
  4. The model generates from a complete closed molecule instead of decorating attachment points

Result: When using scaffold decoration for RL fine-tuning, the model generates minimal molecules (e.g., methane) instead of scaffold-containing structures because the scaffold is already complete.

Questions

  1. Is this the intended behavior for scaffold_decoration()?
  2. How should scaffolds with [*] be encoded for use as generation prefixes while preserving attachment points?
  3. Is there a way to preserve attachment points during SAFE encoding?
  4. Should we use a different approach for scaffold-constrained generation in RL settings?

Use Case

We're using SAFE for RL-based molecular optimization with scaffold constraints:

  1. User provides scaffold SMILES with [*] attachment points
  2. Scaffold should be SAFE-encoded for use as training prefix
  3. Model should learn to generate molecules that start with the scaffold and add decorations at attachment points
  4. Current behavior: Model generates from closed scaffolds, produces minimal molecules

Is scaffold decoration designed to work differently than we expect, or is this a bug in the encoding logic?

Test Script

I've created a comprehensive test script that demonstrates the issue:

#!/usr/bin/env python
"""Test SAFE scaffold encoding with attachment points."""

from safe import SAFEConverter
import safe as sf

encoder = SAFEConverter()

test_cases = [
    'O=c1[nH]cnc2nc([*])ccc12',  # Purine
    'c1ccc([*])cc1',              # Benzene
    'c1cc([*])ccc1[*]',           # Dual attachment
]

for scaffold in test_cases:
    with sf.utils.attr_as(encoder, 'slicer', None):
        encoded = encoder.encoder(scaffold, allow_empty=True)
    decoded = encoder.decoder(encoded)
    
    print(f"Original: {scaffold}")
    print(f"Encoded:  {encoded}")
    print(f"Decoded:  {decoded}")
    print(f"[*] preserved: {('[*]' in decoded)}")
    print()

All test cases lose their attachment points during encoding.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions