Skip to content

AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro #947

@kndtran

Description

@kndtran

Summary

In transformers v5, AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro") resolves to GPT2Tokenizer, which constructs BPE from vocab.json/merges.txt. In v4, it resolved to GPT2TokenizerFast, which loaded the pre-built tokenizer.json. The two file sources produce different token IDs for strings containing numbers and punctuation.

This is a silent regression — the v5 GPT2Tokenizer inherits from TokenizersBackend (Rust-based), so tokenizer.is_fast returns True and the class appears to work correctly. But the token IDs differ from what LoRA adapters were trained with, causing downstream accuracy regressions.

Ask: regression test in transformers

A tokenizer regression test should be added to transformers to catch future changes that alter token IDs for existing models. The test should:

  1. Assert AutoTokenizer.from_pretrained() returns a PreTrainedTokenizerFast instance
  2. Assert tokenizer.encode() produces known-good token IDs for strings with numbers and punctuation (the divergent cases)
  3. Run without GPU — tokenizer-only, no model loading

This would have caught the v5 regression immediately in CI instead of after a full eval cycle.

Token ID divergence

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro")
Input v4.57.6 (tokenizer.json) v5.4.0 (vocab.json/merges.txt) v5.7.0 (unchanged)
"2023" [2366, 18] [508, 1419] [508, 1419]
"650841823" [13655, 25496, 23848] [13655, 5833, 972, 1419] [13655, 5833, 972, 1419]
"409473852" [12378, 21505, 24571] [12378, 2618, 1987, 4103] [12378, 2618, 1987, 4103]
"914588298" [24579, 20691, 17690] [24, 9591, 2421, 17690] [24, 9591, 2421, 17690]
"60-138-3818" [1399, 12, 10350, 12, 19162, 23] [1399, 12, 10350, 12, 1987, 972] [1399, 12, 10350, 12, 1987, 972]
"65-005-6716" [2397, 12, 8504, 12, 23403, 21] [2397, 12, 8504, 12, 3080, 845] [2397, 12, 8504, 12, 3080, 845]
"d.o.o" [67, 14778, 14778] [67, 13, 78, 13, 78] [67, 13, 78, 13, 78]
"D&B Score" [35, 49339, 18607] [35, 5, 33, 18607] [35, 5, 33, 18607]
"corp.net" [81827, 5181] [81827, 13, 4816] [81827, 13, 4816]
"FY2020" [82029, 2366, 15] [82029, 508, 508] [82029, 508, 508]
"FY2023" [82029, 2366, 18] [82029, 508, 1419] [82029, 508, 1419]
"Q3 2024" [48, 18, 220, 2366, 19] [48, 18, 220, 508, 1187] [48, 18, 220, 508, 1187]
"H1 2025" [39, 16, 220, 2366, 20] [39, 16, 220, 508, 914] [39, 16, 220, 508, 914]
"Broadcom in 2023" [69424, 884, 304, 220, 2366, 18] [69424, 884, 304, 220, 508, 1419] [69424, 884, 304, 220, 508, 1419]
"Maruti Enterprises in 2022" [12331, 32973, 67056, 304, 220, 2366, 17] [12331, 32973, 67056, 304, 220, 508, 1313] [12331, 32973, 67056, 304, 220, 508, 1313]
"spend in 2023" [2203, 408, 304, 220, 2366, 18] [2203, 408, 304, 220, 508, 1419] [2203, 408, 304, 220, 508, 1419]
"NAICS 541512" [7476, 19645, 220, 22058, 8358] [7476, 19645, 220, 4370, 868, 717] [7476, 19645, 220, 4370, 868, 717]
"IMAGINE d.o.o" [1829, 1929, 4069, 294, 14778, 14778] [1829, 1929, 4069, 294, 13, 78, 13, 78] [1829, 1929, 4069, 294, 13, 78, 13, 78]
"ISO 9001:2015" [25141, 220, 7467, 16, 25, 679, 20] [25141, 220, 24, 4119, 25, 679, 20] [25141, 220, 24, 4119, 25, 679, 20]
"ref#2847" [1116, 2, 17058, 22] [1116, 2, 1591, 2618] [1116, 2, 1591, 2618]
"Hello world" [9906, 1917] [9906, 1917] (match) [9906, 1917] (match)

The regression introduced in v5.4.0 persists in v5.7.0 — all 20 divergent cases remain unfixed.

Why this happens

Transformers v4.57.6AutoTokenizer resolves to GPT2TokenizerFast, which loads tokenizer.json:

tokenization_gpt2_fast.py L24:

VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"}

The tokenizer_file is passed to PreTrainedTokenizerFast.__init__(), which loads the complete tokenizer pipeline from tokenizer.json.

Transformers v5.4.0+ (including v5.7.0)tokenization_gpt2_fast.py is removed. AutoTokenizer resolves to GPT2Tokenizer, which constructs BPE from vocab.json/merges.txt:

tokenization_gpt2.py L25-28:

VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}

The BPE model is constructed inline at L54-63:

self._tokenizer = Tokenizer(
    BPE(
        vocab=self._vocab,
        merges=self._merges,
        dropout=None,
        continuing_subword_prefix="",
        end_of_word_suffix="",
        fuse_unk=False,
    )
)

No tokenizer_file/tokenizer.json reference exists in the v5 file. For the original openai-community/gpt2 model, vocab.json/merges.txt and tokenizer.json are consistent, so both paths produce the same IDs. For granite-4.0-micro, whose tokenizer.json encodes a different BPE configuration, they diverge.

Impact on downstream tasks

The LoRA adapters (ibm-granite/granitelib-rag-r1.0) were trained with tokenizer.json token IDs. When inference uses different IDs, the adapter's learned weights are applied to the wrong token positions. Three of six RAG intrinsics regressed:

Intrinsic Drop Why
Context Relevance −24.1pp (92.1% → 68.0%) Short prompts with diverse document vocabulary — many tokens hit divergent BPE merges. The model predicted "partially relevant" instead of "relevant" for inputs that should have been clear matches.
Query Rewrite −8.4pp (86.9% → 78.5%) The model must reproduce input text verbatim in the rewritten query. Divergent token IDs cause character-level corruption in the output (e.g. "2023""20 23", "d.o.o""d.oo").
Hallucination Detection −1.5pp (88.3% → 86.8%) Long prompts where most tokens are common English words — fewer tokens affected relative to total prompt length.

The other three intrinsics (answerability, clarification, citations) were unaffected because their eval scripts already used PreTrainedTokenizerFast or their prompts did not exercise the divergent BPE merges.

Workaround

Use PreTrainedTokenizerFast directly, which loads tokenizer.json:

from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("ibm-granite/granite-4.0-micro")

Environment

  • Model: ibm-granite/granite-4.0-micro
  • Transformers v4 baseline: 4.57.6 (uses GPT2TokenizerFast + tokenizer.json)
  • Transformers v5 regression: 5.4.0 through 5.7.0 (uses GPT2Tokenizer + vocab.json/merges.txt)
  • Python: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions