Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model by omarnj-lab · Pull Request #3762 · huggingface/sentence-transformers

omarnj-lab · 2026-05-10T08:52:31Z

Fall back to `AutoTokenizer` when `AutoProcessor` cannot recognize the model

Summary

Restores SentenceTransformer(...) support for text-only models whose tokenizer is registered via auto_map -> AutoTokenizer (typically with trust_remote_code=True) but that don't ship a processor_config.json / preprocessor_config.json / image preprocessor entry.

Before this PR, such models fail at load time with:

ValueError: Unrecognized processing class in <repo>.
Can't instantiate a processor, a tokenizer, an image processor,
a video processor or a feature extractor for this model.
Make sure the repository contains the files of at least one of those processing classes.

even though the same model loads fine via transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True).

What changed

sentence_transformers/base/modules/transformer.py — when AutoProcessor.from_pretrained raises ValueError("Unrecognized processing class …") (or the equivalent "does not contain …" form), fall back to AutoTokenizer.from_pretrained(...) and assign the result to self.processor. The existing tokenizer property already handles the case where self.processor is itself a PreTrainedTokenizerBase, so no further plumbing is required.

Backward-compatible: if AutoProcessor succeeds, behavior is unchanged. Other ValueErrors (e.g. real config issues) are re-raised as before.

Why this matters

The failure mode is common for community models that:

ship a custom Python tokenizer (auto_map -> AutoTokenizer) for things like morphological analysis or domain-specific preprocessing,
never define a multimodal processor.

Some examples in the wild:

U4RASD/NeoAraBERT_MSA and the NeoAraBERT family (Arabic, ACL 2026) — diacritics-aware morphological tokenizer.
Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 (the test model used here).
Several other custom-tokenizer encoders (multiple Arabic, Persian, and code-domain models).

These models currently must be loaded manually via AutoModel + AutoTokenizer and a hand-rolled mean-pooling routine, which:

Excludes them from the entire SentenceTransformer ecosystem (MTEB CLI, EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex / Haystack that wrap SentenceTransformer).
Forces every downstream user to re-implement basic encoding correctly.

This 20-line fallback unlocks all of them.

Test

Added tests/test_transformer_autotokenizer_fallback.py, an integration test that:

Loads Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 (a public model whose custom tokenizer triggers the failure mode).
Asserts the resulting Transformer.tokenizer is a real PreTrainedTokenizerBase.
Encodes a Muradif-style synonym triplet and asserts cosine(anchor, synonym) > cosine(anchor, irrelevant).

Marked @pytest.mark.slow since it downloads weights. Suggesting it be added to the slow-CI lane.

Reproduction (before this PR)

from sentence_transformers import SentenceTransformer
SentenceTransformer(
    "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1",
    trust_remote_code=True,
)
# -> ValueError: Unrecognized processing class in <repo>. Can't instantiate a processor, a tokenizer, an image processor, a video processor or a feature extractor for this model.

After this PR

m = SentenceTransformer(
    "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1",
    trust_remote_code=True,
)
e = m.encode(["صلاة الجمعة في المسجد", "الصلاة في الجامع", "السباحة في البحر"], normalize_embeddings=True)
# (3, 768)
# cos(anchor, synonym)    = 0.872
# cos(anchor, irrelevant) = 0.330

Risk

Minimal. The fallback is only triggered when AutoProcessor already raises a specific ValueError, so models that load today are not affected. The tokenizer property already handles processor instanceof PreTrainedTokenizerBase, so no changes required elsewhere.

…ognize the model Restores SentenceTransformer support for text-only models whose tokenizer is registered via auto_map -> AutoTokenizer (typically with trust_remote_code=True) but that don't ship a processor_config.json / preprocessor_config.json / image preprocessor entry. Before this change, such models fail at load time with: ValueError: Unrecognized processing class in <repo>. Can't instantiate a processor, a tokenizer, an image processor, a video processor or a feature extractor for this model. Make sure the repository contains the files of at least one of those processing classes. even though the same model loads fine via ``transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True)``. The Transformer module's downstream code already handles the case where self.processor IS itself a PreTrainedTokenizerBase (the ``tokenizer`` property explicitly checks for this). We therefore safely fall back to AutoTokenizer when AutoProcessor rejects the repo, preserving full backward compatibility. This unlocks the entire SentenceTransformer ecosystem (MTEB CLI, EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex / Haystack) for community models that ship a custom Python tokenizer (e.g. the NeoAraBERT family for Arabic, accepted at ACL 2026, and several other multilingual / domain-specific encoders). Tested with Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 (custom Arabic morphological tokenizer registered via auto_map). Loading via SentenceTransformer(..., trust_remote_code=True) now succeeds and produces correct embeddings (anchor-vs-synonym cosine 0.872, anchor-vs-irrelevant cosine 0.330 on a Muradif-style triplet).

tomaarsen · 2026-05-11T08:12:22Z

Hello!

I do want to get this fixed, but I think perhaps this is best suited in transformers instead. For context, transformers does try AutoTokenizer after finding that there's no actual 'processor', and it passes along trust_remote_code etc., but it seems that an exception gets thrown for AutoTokenizer.from_pretrained(...), which is then swallowed and we instead get the generic error that you're seeing. Here's the section:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/auto/processing_auto.py#L443-L451

My guess is that the error is that xformers isn't installed. Perhaps a neat solution is expanding that except Exception: continue to re-raise if the "Dependency is not installed" error is thrown?

Tom Aarsen

Omartificial-Intelligence-Space and others added 2 commits May 10, 2026 11:38

Merge branch 'main' into fix/autoprocessor-autotokenizer-fallback

f60f51b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model#3762

Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model#3762
omarnj-lab wants to merge 2 commits into
huggingface:mainfrom
omarnj-lab:fix/autoprocessor-autotokenizer-fallback

omarnj-lab commented May 10, 2026

Uh oh!

tomaarsen commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

omarnj-lab commented May 10, 2026

Fall back to AutoTokenizer when AutoProcessor cannot recognize the model

Summary

What changed

Why this matters

Test

Reproduction (before this PR)

After this PR

Risk

Uh oh!

tomaarsen commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fall back to `AutoTokenizer` when `AutoProcessor` cannot recognize the model