Summary
In transformers v5, AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro") resolves to GPT2Tokenizer, which constructs BPE from vocab.json/merges.txt. In v4, it resolved to GPT2TokenizerFast, which loaded the pre-built tokenizer.json. The two file sources produce different token IDs for strings containing numbers and punctuation.
This is a silent regression — the v5 GPT2Tokenizer inherits from TokenizersBackend (Rust-based), so tokenizer.is_fast returns True and the class appears to work correctly. But the token IDs differ from what LoRA adapters were trained with, causing downstream accuracy regressions.
Ask: regression test in transformers
A tokenizer regression test should be added to transformers to catch future changes that alter token IDs for existing models. The test should:
- Assert
AutoTokenizer.from_pretrained() returns a PreTrainedTokenizerFast instance
- Assert
tokenizer.encode() produces known-good token IDs for strings with numbers and punctuation (the divergent cases)
- Run without GPU — tokenizer-only, no model loading
This would have caught the v5 regression immediately in CI instead of after a full eval cycle.
Token ID divergence
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro")
| Input |
v4.57.6 (tokenizer.json) |
v5.4.0 (vocab.json/merges.txt) |
v5.7.0 (unchanged) |
"2023" |
[2366, 18] |
[508, 1419] |
[508, 1419] |
"650841823" |
[13655, 25496, 23848] |
[13655, 5833, 972, 1419] |
[13655, 5833, 972, 1419] |
"409473852" |
[12378, 21505, 24571] |
[12378, 2618, 1987, 4103] |
[12378, 2618, 1987, 4103] |
"914588298" |
[24579, 20691, 17690] |
[24, 9591, 2421, 17690] |
[24, 9591, 2421, 17690] |
"60-138-3818" |
[1399, 12, 10350, 12, 19162, 23] |
[1399, 12, 10350, 12, 1987, 972] |
[1399, 12, 10350, 12, 1987, 972] |
"65-005-6716" |
[2397, 12, 8504, 12, 23403, 21] |
[2397, 12, 8504, 12, 3080, 845] |
[2397, 12, 8504, 12, 3080, 845] |
"d.o.o" |
[67, 14778, 14778] |
[67, 13, 78, 13, 78] |
[67, 13, 78, 13, 78] |
"D&B Score" |
[35, 49339, 18607] |
[35, 5, 33, 18607] |
[35, 5, 33, 18607] |
"corp.net" |
[81827, 5181] |
[81827, 13, 4816] |
[81827, 13, 4816] |
"FY2020" |
[82029, 2366, 15] |
[82029, 508, 508] |
[82029, 508, 508] |
"FY2023" |
[82029, 2366, 18] |
[82029, 508, 1419] |
[82029, 508, 1419] |
"Q3 2024" |
[48, 18, 220, 2366, 19] |
[48, 18, 220, 508, 1187] |
[48, 18, 220, 508, 1187] |
"H1 2025" |
[39, 16, 220, 2366, 20] |
[39, 16, 220, 508, 914] |
[39, 16, 220, 508, 914] |
"Broadcom in 2023" |
[69424, 884, 304, 220, 2366, 18] |
[69424, 884, 304, 220, 508, 1419] |
[69424, 884, 304, 220, 508, 1419] |
"Maruti Enterprises in 2022" |
[12331, 32973, 67056, 304, 220, 2366, 17] |
[12331, 32973, 67056, 304, 220, 508, 1313] |
[12331, 32973, 67056, 304, 220, 508, 1313] |
"spend in 2023" |
[2203, 408, 304, 220, 2366, 18] |
[2203, 408, 304, 220, 508, 1419] |
[2203, 408, 304, 220, 508, 1419] |
"NAICS 541512" |
[7476, 19645, 220, 22058, 8358] |
[7476, 19645, 220, 4370, 868, 717] |
[7476, 19645, 220, 4370, 868, 717] |
"IMAGINE d.o.o" |
[1829, 1929, 4069, 294, 14778, 14778] |
[1829, 1929, 4069, 294, 13, 78, 13, 78] |
[1829, 1929, 4069, 294, 13, 78, 13, 78] |
"ISO 9001:2015" |
[25141, 220, 7467, 16, 25, 679, 20] |
[25141, 220, 24, 4119, 25, 679, 20] |
[25141, 220, 24, 4119, 25, 679, 20] |
"ref#2847" |
[1116, 2, 17058, 22] |
[1116, 2, 1591, 2618] |
[1116, 2, 1591, 2618] |
"Hello world" |
[9906, 1917] |
[9906, 1917] (match) |
[9906, 1917] (match) |
The regression introduced in v5.4.0 persists in v5.7.0 — all 20 divergent cases remain unfixed.
Why this happens
Transformers v4.57.6 — AutoTokenizer resolves to GPT2TokenizerFast, which loads tokenizer.json:
tokenization_gpt2_fast.py L24:
VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"}
The tokenizer_file is passed to PreTrainedTokenizerFast.__init__(), which loads the complete tokenizer pipeline from tokenizer.json.
Transformers v5.4.0+ (including v5.7.0) — tokenization_gpt2_fast.py is removed. AutoTokenizer resolves to GPT2Tokenizer, which constructs BPE from vocab.json/merges.txt:
tokenization_gpt2.py L25-28:
VOCAB_FILES_NAMES = {
"vocab_file": "vocab.json",
"merges_file": "merges.txt",
}
The BPE model is constructed inline at L54-63:
self._tokenizer = Tokenizer(
BPE(
vocab=self._vocab,
merges=self._merges,
dropout=None,
continuing_subword_prefix="",
end_of_word_suffix="",
fuse_unk=False,
)
)
No tokenizer_file/tokenizer.json reference exists in the v5 file. For the original openai-community/gpt2 model, vocab.json/merges.txt and tokenizer.json are consistent, so both paths produce the same IDs. For granite-4.0-micro, whose tokenizer.json encodes a different BPE configuration, they diverge.
Impact on downstream tasks
The LoRA adapters (ibm-granite/granitelib-rag-r1.0) were trained with tokenizer.json token IDs. When inference uses different IDs, the adapter's learned weights are applied to the wrong token positions. Three of six RAG intrinsics regressed:
| Intrinsic |
Drop |
Why |
| Context Relevance |
−24.1pp (92.1% → 68.0%) |
Short prompts with diverse document vocabulary — many tokens hit divergent BPE merges. The model predicted "partially relevant" instead of "relevant" for inputs that should have been clear matches. |
| Query Rewrite |
−8.4pp (86.9% → 78.5%) |
The model must reproduce input text verbatim in the rewritten query. Divergent token IDs cause character-level corruption in the output (e.g. "2023" → "20 23", "d.o.o" → "d.oo"). |
| Hallucination Detection |
−1.5pp (88.3% → 86.8%) |
Long prompts where most tokens are common English words — fewer tokens affected relative to total prompt length. |
The other three intrinsics (answerability, clarification, citations) were unaffected because their eval scripts already used PreTrainedTokenizerFast or their prompts did not exercise the divergent BPE merges.
Workaround
Use PreTrainedTokenizerFast directly, which loads tokenizer.json:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("ibm-granite/granite-4.0-micro")
Environment
- Model:
ibm-granite/granite-4.0-micro
- Transformers v4 baseline: 4.57.6 (uses
GPT2TokenizerFast + tokenizer.json)
- Transformers v5 regression: 5.4.0 through 5.7.0 (uses
GPT2Tokenizer + vocab.json/merges.txt)
- Python: 3.12
Summary
In transformers v5,
AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro")resolves toGPT2Tokenizer, which constructs BPE fromvocab.json/merges.txt. In v4, it resolved toGPT2TokenizerFast, which loaded the pre-builttokenizer.json. The two file sources produce different token IDs for strings containing numbers and punctuation.This is a silent regression — the v5
GPT2Tokenizerinherits fromTokenizersBackend(Rust-based), sotokenizer.is_fastreturnsTrueand the class appears to work correctly. But the token IDs differ from what LoRA adapters were trained with, causing downstream accuracy regressions.Ask: regression test in transformers
A tokenizer regression test should be added to transformers to catch future changes that alter token IDs for existing models. The test should:
AutoTokenizer.from_pretrained()returns aPreTrainedTokenizerFastinstancetokenizer.encode()produces known-good token IDs for strings with numbers and punctuation (the divergent cases)This would have caught the v5 regression immediately in CI instead of after a full eval cycle.
Token ID divergence
tokenizer.json)vocab.json/merges.txt)"2023"[2366, 18][508, 1419][508, 1419]"650841823"[13655, 25496, 23848][13655, 5833, 972, 1419][13655, 5833, 972, 1419]"409473852"[12378, 21505, 24571][12378, 2618, 1987, 4103][12378, 2618, 1987, 4103]"914588298"[24579, 20691, 17690][24, 9591, 2421, 17690][24, 9591, 2421, 17690]"60-138-3818"[1399, 12, 10350, 12, 19162, 23][1399, 12, 10350, 12, 1987, 972][1399, 12, 10350, 12, 1987, 972]"65-005-6716"[2397, 12, 8504, 12, 23403, 21][2397, 12, 8504, 12, 3080, 845][2397, 12, 8504, 12, 3080, 845]"d.o.o"[67, 14778, 14778][67, 13, 78, 13, 78][67, 13, 78, 13, 78]"D&B Score"[35, 49339, 18607][35, 5, 33, 18607][35, 5, 33, 18607]"corp.net"[81827, 5181][81827, 13, 4816][81827, 13, 4816]"FY2020"[82029, 2366, 15][82029, 508, 508][82029, 508, 508]"FY2023"[82029, 2366, 18][82029, 508, 1419][82029, 508, 1419]"Q3 2024"[48, 18, 220, 2366, 19][48, 18, 220, 508, 1187][48, 18, 220, 508, 1187]"H1 2025"[39, 16, 220, 2366, 20][39, 16, 220, 508, 914][39, 16, 220, 508, 914]"Broadcom in 2023"[69424, 884, 304, 220, 2366, 18][69424, 884, 304, 220, 508, 1419][69424, 884, 304, 220, 508, 1419]"Maruti Enterprises in 2022"[12331, 32973, 67056, 304, 220, 2366, 17][12331, 32973, 67056, 304, 220, 508, 1313][12331, 32973, 67056, 304, 220, 508, 1313]"spend in 2023"[2203, 408, 304, 220, 2366, 18][2203, 408, 304, 220, 508, 1419][2203, 408, 304, 220, 508, 1419]"NAICS 541512"[7476, 19645, 220, 22058, 8358][7476, 19645, 220, 4370, 868, 717][7476, 19645, 220, 4370, 868, 717]"IMAGINE d.o.o"[1829, 1929, 4069, 294, 14778, 14778][1829, 1929, 4069, 294, 13, 78, 13, 78][1829, 1929, 4069, 294, 13, 78, 13, 78]"ISO 9001:2015"[25141, 220, 7467, 16, 25, 679, 20][25141, 220, 24, 4119, 25, 679, 20][25141, 220, 24, 4119, 25, 679, 20]"ref#2847"[1116, 2, 17058, 22][1116, 2, 1591, 2618][1116, 2, 1591, 2618]"Hello world"[9906, 1917][9906, 1917](match)[9906, 1917](match)The regression introduced in v5.4.0 persists in v5.7.0 — all 20 divergent cases remain unfixed.
Why this happens
Transformers v4.57.6 —
AutoTokenizerresolves toGPT2TokenizerFast, which loadstokenizer.json:tokenization_gpt2_fast.pyL24:The
tokenizer_fileis passed toPreTrainedTokenizerFast.__init__(), which loads the complete tokenizer pipeline fromtokenizer.json.Transformers v5.4.0+ (including v5.7.0) —
tokenization_gpt2_fast.pyis removed.AutoTokenizerresolves toGPT2Tokenizer, which constructs BPE fromvocab.json/merges.txt:tokenization_gpt2.pyL25-28:The BPE model is constructed inline at L54-63:
No
tokenizer_file/tokenizer.jsonreference exists in the v5 file. For the originalopenai-community/gpt2model,vocab.json/merges.txtandtokenizer.jsonare consistent, so both paths produce the same IDs. Forgranite-4.0-micro, whosetokenizer.jsonencodes a different BPE configuration, they diverge.Impact on downstream tasks
The LoRA adapters (
ibm-granite/granitelib-rag-r1.0) were trained withtokenizer.jsontoken IDs. When inference uses different IDs, the adapter's learned weights are applied to the wrong token positions. Three of six RAG intrinsics regressed:"partially relevant"instead of"relevant"for inputs that should have been clear matches."2023"→"20 23","d.o.o"→"d.oo").The other three intrinsics (answerability, clarification, citations) were unaffected because their eval scripts already used
PreTrainedTokenizerFastor their prompts did not exercise the divergent BPE merges.Workaround
Use
PreTrainedTokenizerFastdirectly, which loadstokenizer.json:Environment
ibm-granite/granite-4.0-microGPT2TokenizerFast+tokenizer.json)GPT2Tokenizer+vocab.json/merges.txt)