Skip to content

Commit 4fcfca4

Browse files
svc-bionemoclaude
andcommitted
Fix ESM2 tokenizer export: patch tokenizer_config.json to use PreTrainedTokenizerFast
In transformers 5.x, AutoTokenizer serializes the class name as "TokenizersBackend" which is not resolvable by AutoTokenizer.from_pretrained(). Patch the saved tokenizer_config.json after save_pretrained() to force tokenizer_class="PreTrainedTokenizerFast" and remove non-standard fields. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
1 parent 748ec4c commit 4fcfca4

2 files changed

Lines changed: 20 additions & 0 deletions

File tree

bionemo-recipes/models/esm2/export.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,16 @@ def export_hf_checkpoint(tag: str, export_path: Path):
8181
tokenizer = AutoTokenizer.from_pretrained("esm_fast_tokenizer") # Use our PreTrainedTokenizerFast implementation.
8282
tokenizer.save_pretrained(export_path / tag)
8383

84+
# Patch tokenizer_config.json: transformers 5.x saves "TokenizersBackend" which AutoTokenizer cannot resolve.
85+
tokenizer_config_path = export_path / tag / "tokenizer_config.json"
86+
with open(tokenizer_config_path, "r") as f:
87+
tokenizer_config = json.load(f)
88+
tokenizer_config["tokenizer_class"] = "PreTrainedTokenizerFast"
89+
tokenizer_config.pop("backend", None)
90+
tokenizer_config.pop("is_local", None)
91+
with open(tokenizer_config_path, "w") as f:
92+
json.dump(tokenizer_config, f, indent=2, sort_keys=True)
93+
8494
# Patch the config
8595
with open(export_path / tag / "config.json", "r") as f:
8696
config = json.load(f)

bionemo-recipes/recipes/vllm_inference/esm2/export.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,16 @@ def export_hf_checkpoint(tag: str, export_path: Path):
8787
tokenizer = AutoTokenizer.from_pretrained("esm_fast_tokenizer") # Use our PreTrainedTokenizerFast implementation.
8888
tokenizer.save_pretrained(export_path / tag)
8989

90+
# Patch tokenizer_config.json: transformers 5.x saves "TokenizersBackend" which AutoTokenizer cannot resolve.
91+
tokenizer_config_path = export_path / tag / "tokenizer_config.json"
92+
with open(tokenizer_config_path, "r") as f:
93+
tokenizer_config = json.load(f)
94+
tokenizer_config["tokenizer_class"] = "PreTrainedTokenizerFast"
95+
tokenizer_config.pop("backend", None)
96+
tokenizer_config.pop("is_local", None)
97+
with open(tokenizer_config_path, "w") as f:
98+
json.dump(tokenizer_config, f, indent=2, sort_keys=True)
99+
90100
# Patch the config
91101
with open(export_path / tag / "config.json", "r") as f:
92102
config = json.load(f)

0 commit comments

Comments
 (0)