Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model#3762
Open
omarnj-lab wants to merge 2 commits into
Open
Conversation
…ognize the model
Restores SentenceTransformer support for text-only models whose tokenizer is
registered via auto_map -> AutoTokenizer (typically with trust_remote_code=True)
but that don't ship a processor_config.json / preprocessor_config.json / image
preprocessor entry.
Before this change, such models fail at load time with:
ValueError: Unrecognized processing class in <repo>.
Can't instantiate a processor, a tokenizer, an image processor, a video
processor or a feature extractor for this model. Make sure the repository
contains the files of at least one of those processing classes.
even though the same model loads fine via
``transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True)``.
The Transformer module's downstream code already handles the case where
self.processor IS itself a PreTrainedTokenizerBase (the ``tokenizer`` property
explicitly checks for this). We therefore safely fall back to AutoTokenizer
when AutoProcessor rejects the repo, preserving full backward compatibility.
This unlocks the entire SentenceTransformer ecosystem (MTEB CLI,
EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex /
Haystack) for community models that ship a custom Python tokenizer (e.g.
the NeoAraBERT family for Arabic, accepted at ACL 2026, and several other
multilingual / domain-specific encoders).
Tested with Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1
(custom Arabic morphological tokenizer registered via auto_map). Loading via
SentenceTransformer(..., trust_remote_code=True) now succeeds and produces
correct embeddings (anchor-vs-synonym cosine 0.872, anchor-vs-irrelevant
cosine 0.330 on a Muradif-style triplet).
Member
|
Hello! I do want to get this fixed, but I think perhaps this is best suited in My guess is that the error is that
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fall back to
AutoTokenizerwhenAutoProcessorcannot recognize the modelSummary
Restores
SentenceTransformer(...)support for text-only models whose tokenizer is registered viaauto_map -> AutoTokenizer(typically withtrust_remote_code=True) but that don't ship aprocessor_config.json/preprocessor_config.json/ image preprocessor entry.Before this PR, such models fail at load time with:
even though the same model loads fine via
transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True).What changed
sentence_transformers/base/modules/transformer.py— whenAutoProcessor.from_pretrainedraisesValueError("Unrecognized processing class …")(or the equivalent "does not contain …" form), fall back toAutoTokenizer.from_pretrained(...)and assign the result toself.processor. The existingtokenizerproperty already handles the case whereself.processoris itself aPreTrainedTokenizerBase, so no further plumbing is required.Backward-compatible: if
AutoProcessorsucceeds, behavior is unchanged. OtherValueErrors (e.g. real config issues) are re-raised as before.Why this matters
The failure mode is common for community models that:
auto_map -> AutoTokenizer) for things like morphological analysis or domain-specific preprocessing,Some examples in the wild:
U4RASD/NeoAraBERT_MSAand the NeoAraBERT family (Arabic, ACL 2026) — diacritics-aware morphological tokenizer.Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1(the test model used here).These models currently must be loaded manually via
AutoModel+AutoTokenizerand a hand-rolled mean-pooling routine, which:EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex / Haystack that wrapSentenceTransformer).This 20-line fallback unlocks all of them.
Test
Added
tests/test_transformer_autotokenizer_fallback.py, an integration test that:Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1(a public model whose custom tokenizer triggers the failure mode).Transformer.tokenizeris a realPreTrainedTokenizerBase.Marked
@pytest.mark.slowsince it downloads weights. Suggesting it be added to the slow-CI lane.Reproduction (before this PR)
After this PR
Risk
Minimal. The fallback is only triggered when
AutoProcessoralready raises a specificValueError, so models that load today are not affected. Thetokenizerproperty already handlesprocessor instanceof PreTrainedTokenizerBase, so no changes required elsewhere.