Skip to content

Upgrade transformers to 5.x (drops conftest mistral monkey-patch) #295

Description

@voorhs

Background

tests/conftest.py currently monkey-patches transformers.PreTrainedTokenizerBase._patch_mistral_regex to a no-op as a workaround for a confirmed transformers bug: on every tokenizer load with vocab_size > 100000 (e.g. our default intfloat/multilingual-e5-*), the method calls huggingface_hub.model_info() unconditionally. On CI that hammers the 1000-req/5-min HF rate limit and 429s.

Upstream issues / PRs documenting the bug:

The #45444 merge commit message explicitly says:

faulty mistral tokenizers saved with version 4.57.{3,4,5,6} will load incorrectly... use 5.0.0 as 5.0.0rc0 introduced the fix... Exclusively check for mistral BEFORE v5

So:

  • transformers 4.57.x has the bug — we're on 4.57.6, confirmed buggy
  • transformers 5.0.0+ has the fix — latest is 5.10.2 (released yesterday)
  • The fix was not backported to 4.x — it's "v5-only" by design

Task

Upgrade transformers to >=5.0 in pyproject.toml, then delete _disable_transformers_mistral_regex_patch() from tests/conftest.py.

Risk

transformers 5.0 is a major version bump (Mistral 4 support, removed deprecated APIs, etc). Likely surfaces breaking changes elsewhere in the codebase that would need addressing separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions