Skip to content

Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model#3762

Open
omarnj-lab wants to merge 2 commits into
huggingface:mainfrom
omarnj-lab:fix/autoprocessor-autotokenizer-fallback
Open

Transformer: fall back to AutoTokenizer when AutoProcessor cannot recognize the model#3762
omarnj-lab wants to merge 2 commits into
huggingface:mainfrom
omarnj-lab:fix/autoprocessor-autotokenizer-fallback

Conversation

@omarnj-lab
Copy link
Copy Markdown

Fall back to AutoTokenizer when AutoProcessor cannot recognize the model

Summary

Restores SentenceTransformer(...) support for text-only models whose tokenizer is registered via auto_map -> AutoTokenizer (typically with trust_remote_code=True) but that don't ship a processor_config.json / preprocessor_config.json / image preprocessor entry.

Before this PR, such models fail at load time with:

ValueError: Unrecognized processing class in <repo>.
Can't instantiate a processor, a tokenizer, an image processor,
a video processor or a feature extractor for this model.
Make sure the repository contains the files of at least one of those processing classes.

even though the same model loads fine via transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True).

What changed

sentence_transformers/base/modules/transformer.py — when AutoProcessor.from_pretrained raises ValueError("Unrecognized processing class …") (or the equivalent "does not contain …" form), fall back to AutoTokenizer.from_pretrained(...) and assign the result to self.processor. The existing tokenizer property already handles the case where self.processor is itself a PreTrainedTokenizerBase, so no further plumbing is required.

Backward-compatible: if AutoProcessor succeeds, behavior is unchanged. Other ValueErrors (e.g. real config issues) are re-raised as before.

Why this matters

The failure mode is common for community models that:

  • ship a custom Python tokenizer (auto_map -> AutoTokenizer) for things like morphological analysis or domain-specific preprocessing,
  • never define a multimodal processor.

Some examples in the wild:

  • U4RASD/NeoAraBERT_MSA and the NeoAraBERT family (Arabic, ACL 2026) — diacritics-aware morphological tokenizer.
  • Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 (the test model used here).
  • Several other custom-tokenizer encoders (multiple Arabic, Persian, and code-domain models).

These models currently must be loaded manually via AutoModel + AutoTokenizer and a hand-rolled mean-pooling routine, which:

  1. Excludes them from the entire SentenceTransformer ecosystem (MTEB CLI, EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex / Haystack that wrap SentenceTransformer).
  2. Forces every downstream user to re-implement basic encoding correctly.

This 20-line fallback unlocks all of them.

Test

Added tests/test_transformer_autotokenizer_fallback.py, an integration test that:

  1. Loads Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1 (a public model whose custom tokenizer triggers the failure mode).
  2. Asserts the resulting Transformer.tokenizer is a real PreTrainedTokenizerBase.
  3. Encodes a Muradif-style synonym triplet and asserts cosine(anchor, synonym) > cosine(anchor, irrelevant).

Marked @pytest.mark.slow since it downloads weights. Suggesting it be added to the slow-CI lane.

Reproduction (before this PR)

from sentence_transformers import SentenceTransformer
SentenceTransformer(
    "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1",
    trust_remote_code=True,
)
# -> ValueError: Unrecognized processing class in <repo>. Can't instantiate a processor, a tokenizer, an image processor, a video processor or a feature extractor for this model.

After this PR

m = SentenceTransformer(
    "Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1",
    trust_remote_code=True,
)
e = m.encode(["صلاة الجمعة في المسجد", "الصلاة في الجامع", "السباحة في البحر"], normalize_embeddings=True)
# (3, 768)
# cos(anchor, synonym)    = 0.872
# cos(anchor, irrelevant) = 0.330

Risk

Minimal. The fallback is only triggered when AutoProcessor already raises a specific ValueError, so models that load today are not affected. The tokenizer property already handles processor instanceof PreTrainedTokenizerBase, so no changes required elsewhere.

Omartificial-Intelligence-Space and others added 2 commits May 10, 2026 11:38
…ognize the model

Restores SentenceTransformer support for text-only models whose tokenizer is
registered via auto_map -> AutoTokenizer (typically with trust_remote_code=True)
but that don't ship a processor_config.json / preprocessor_config.json / image
preprocessor entry.

Before this change, such models fail at load time with:

    ValueError: Unrecognized processing class in <repo>.
    Can't instantiate a processor, a tokenizer, an image processor, a video
    processor or a feature extractor for this model. Make sure the repository
    contains the files of at least one of those processing classes.

even though the same model loads fine via
``transformers.AutoTokenizer.from_pretrained(..., trust_remote_code=True)``.

The Transformer module's downstream code already handles the case where
self.processor IS itself a PreTrainedTokenizerBase (the ``tokenizer`` property
explicitly checks for this). We therefore safely fall back to AutoTokenizer
when AutoProcessor rejects the repo, preserving full backward compatibility.

This unlocks the entire SentenceTransformer ecosystem (MTEB CLI,
EmbeddingSimilarityEvaluator, RAG frameworks like LangChain / LlamaIndex /
Haystack) for community models that ship a custom Python tokenizer (e.g.
the NeoAraBERT family for Arabic, accepted at ACL 2026, and several other
multilingual / domain-specific encoders).

Tested with Omartificial-Intelligence-Space/NeoAraBERT-MSA-Synonym-Matryoshka-V1
(custom Arabic morphological tokenizer registered via auto_map). Loading via
SentenceTransformer(..., trust_remote_code=True) now succeeds and produces
correct embeddings (anchor-vs-synonym cosine 0.872, anchor-vs-irrelevant
cosine 0.330 on a Muradif-style triplet).
@tomaarsen
Copy link
Copy Markdown
Member

Hello!

I do want to get this fixed, but I think perhaps this is best suited in transformers instead. For context, transformers does try AutoTokenizer after finding that there's no actual 'processor', and it passes along trust_remote_code etc., but it seems that an exception gets thrown for AutoTokenizer.from_pretrained(...), which is then swallowed and we instead get the generic error that you're seeing. Here's the section:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/auto/processing_auto.py#L443-L451

My guess is that the error is that xformers isn't installed. Perhaps a neat solution is expanding that except Exception: continue to re-raise if the "Dependency is not installed" error is thrown?

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants