Background
tests/conftest.py currently monkey-patches transformers.PreTrainedTokenizerBase._patch_mistral_regex to a no-op as a workaround for a confirmed transformers bug: on every tokenizer load with vocab_size > 100000 (e.g. our default intfloat/multilingual-e5-*), the method calls huggingface_hub.model_info() unconditionally. On CI that hammers the 1000-req/5-min HF rate limit and 429s.
Upstream issues / PRs documenting the bug:
The #45444 merge commit message explicitly says:
faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly... use 5.0.0 as 5.0.0rc0 introduced the fix... Exclusively check for mistral BEFORE v5
So:
- transformers 4.57.x has the bug — we're on 4.57.6, confirmed buggy
- transformers 5.0.0+ has the fix — latest is 5.10.2 (released yesterday)
- The fix was not backported to 4.x — it's "v5-only" by design
Task
Upgrade transformers to >=5.0 in pyproject.toml, then delete _disable_transformers_mistral_regex_patch() from tests/conftest.py.
Risk
transformers 5.0 is a major version bump (Mistral 4 support, removed deprecated APIs, etc). Likely surfaces breaking changes elsewhere in the codebase that would need addressing separately.
Background
tests/conftest.pycurrently monkey-patchestransformers.PreTrainedTokenizerBase._patch_mistral_regexto a no-op as a workaround for a confirmed transformers bug: on every tokenizer load withvocab_size > 100000(e.g. our defaultintfloat/multilingual-e5-*), the method callshuggingface_hub.model_info()unconditionally. On CI that hammers the 1000-req/5-min HF rate limit and 429s.Upstream issues / PRs documenting the bug:
local_files_only=Truehuggingface/transformers#45545fix] Always early return for non-Mistral models in _patch_mistral_regex huggingface/transformers#45444The #45444 merge commit message explicitly says:
So:
Task
Upgrade transformers to
>=5.0inpyproject.toml, then delete_disable_transformers_mistral_regex_patch()fromtests/conftest.py.Risk
transformers5.0 is a major version bump (Mistral 4 support, removed deprecated APIs, etc). Likely surfaces breaking changes elsewhere in the codebase that would need addressing separately.