AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro

### Summary

In transformers v5, `AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro")` resolves to `GPT2Tokenizer`, which constructs BPE from `vocab.json`/`merges.txt`. In v4, it resolved to `GPT2TokenizerFast`, which loaded the pre-built `tokenizer.json`. The two file sources produce different token IDs for strings containing numbers and punctuation.

This is a silent regression — the v5 `GPT2Tokenizer` inherits from `TokenizersBackend` (Rust-based), so `tokenizer.is_fast` returns `True` and the class appears to work correctly. But the token IDs differ from what LoRA adapters were trained with, causing downstream accuracy regressions.

### Ask: regression test in transformers

A tokenizer regression test should be added to transformers to catch future changes that alter token IDs for existing models. The test should:

1. Assert `AutoTokenizer.from_pretrained()` returns a `PreTrainedTokenizerFast` instance
2. Assert `tokenizer.encode()` produces known-good token IDs for strings with numbers and punctuation (the divergent cases)
3. Run without GPU — tokenizer-only, no model loading

This would have caught the v5 regression immediately in CI instead of after a full eval cycle.

### Token ID divergence

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-4.0-micro")
```

| Input | v4.57.6 (`tokenizer.json`) | v5.4.0 (`vocab.json`/`merges.txt`) | v5.7.0 (unchanged) |
|-------|---------------------------|-----------------------------------|---------------------|
| `"2023"` | `[2366, 18]` | `[508, 1419]` | `[508, 1419]` |
| `"650841823"` | `[13655, 25496, 23848]` | `[13655, 5833, 972, 1419]` | `[13655, 5833, 972, 1419]` |
| `"409473852"` | `[12378, 21505, 24571]` | `[12378, 2618, 1987, 4103]` | `[12378, 2618, 1987, 4103]` |
| `"914588298"` | `[24579, 20691, 17690]` | `[24, 9591, 2421, 17690]` | `[24, 9591, 2421, 17690]` |
| `"60-138-3818"` | `[1399, 12, 10350, 12, 19162, 23]` | `[1399, 12, 10350, 12, 1987, 972]` | `[1399, 12, 10350, 12, 1987, 972]` |
| `"65-005-6716"` | `[2397, 12, 8504, 12, 23403, 21]` | `[2397, 12, 8504, 12, 3080, 845]` | `[2397, 12, 8504, 12, 3080, 845]` |
| `"d.o.o"` | `[67, 14778, 14778]` | `[67, 13, 78, 13, 78]` | `[67, 13, 78, 13, 78]` |
| `"D&B Score"` | `[35, 49339, 18607]` | `[35, 5, 33, 18607]` | `[35, 5, 33, 18607]` |
| `"corp.net"` | `[81827, 5181]` | `[81827, 13, 4816]` | `[81827, 13, 4816]` |
| `"FY2020"` | `[82029, 2366, 15]` | `[82029, 508, 508]` | `[82029, 508, 508]` |
| `"FY2023"` | `[82029, 2366, 18]` | `[82029, 508, 1419]` | `[82029, 508, 1419]` |
| `"Q3 2024"` | `[48, 18, 220, 2366, 19]` | `[48, 18, 220, 508, 1187]` | `[48, 18, 220, 508, 1187]` |
| `"H1 2025"` | `[39, 16, 220, 2366, 20]` | `[39, 16, 220, 508, 914]` | `[39, 16, 220, 508, 914]` |
| `"Broadcom in 2023"` | `[69424, 884, 304, 220, 2366, 18]` | `[69424, 884, 304, 220, 508, 1419]` | `[69424, 884, 304, 220, 508, 1419]` |
| `"Maruti Enterprises in 2022"` | `[12331, 32973, 67056, 304, 220, 2366, 17]` | `[12331, 32973, 67056, 304, 220, 508, 1313]` | `[12331, 32973, 67056, 304, 220, 508, 1313]` |
| `"spend in 2023"` | `[2203, 408, 304, 220, 2366, 18]` | `[2203, 408, 304, 220, 508, 1419]` | `[2203, 408, 304, 220, 508, 1419]` |
| `"NAICS 541512"` | `[7476, 19645, 220, 22058, 8358]` | `[7476, 19645, 220, 4370, 868, 717]` | `[7476, 19645, 220, 4370, 868, 717]` |
| `"IMAGINE d.o.o"` | `[1829, 1929, 4069, 294, 14778, 14778]` | `[1829, 1929, 4069, 294, 13, 78, 13, 78]` | `[1829, 1929, 4069, 294, 13, 78, 13, 78]` |
| `"ISO 9001:2015"` | `[25141, 220, 7467, 16, 25, 679, 20]` | `[25141, 220, 24, 4119, 25, 679, 20]` | `[25141, 220, 24, 4119, 25, 679, 20]` |
| `"ref#2847"` | `[1116, 2, 17058, 22]` | `[1116, 2, 1591, 2618]` | `[1116, 2, 1591, 2618]` |
| `"Hello world"` | `[9906, 1917]` | `[9906, 1917]` (match) | `[9906, 1917]` (match) |

The regression introduced in v5.4.0 persists in v5.7.0 — all 20 divergent cases remain unfixed.

### Why this happens

**Transformers v4.57.6** — `AutoTokenizer` resolves to `GPT2TokenizerFast`, which loads `tokenizer.json`:

[`tokenization_gpt2_fast.py` L24](https://github.com/huggingface/transformers/blob/v4.57.6/src/transformers/models/gpt2/tokenization_gpt2_fast.py#L24):
```python
VOCAB_FILES_NAMES = {"vocab_file": "vocab.json", "merges_file": "merges.txt", "tokenizer_file": "tokenizer.json"}
```

The `tokenizer_file` is passed to [`PreTrainedTokenizerFast.__init__()`](https://github.com/huggingface/transformers/blob/v4.57.6/src/transformers/models/gpt2/tokenization_gpt2_fast.py#L27), which loads the complete tokenizer pipeline from `tokenizer.json`.

**Transformers v5.4.0+ (including v5.7.0)** — `tokenization_gpt2_fast.py` is removed. `AutoTokenizer` resolves to `GPT2Tokenizer`, which constructs BPE from `vocab.json`/`merges.txt`:

[`tokenization_gpt2.py` L25-28](https://github.com/huggingface/transformers/blob/v5.4.0/src/transformers/models/gpt2/tokenization_gpt2.py#L25-L28):
```python
VOCAB_FILES_NAMES = {
    "vocab_file": "vocab.json",
    "merges_file": "merges.txt",
}
```

The BPE model is constructed inline at [L54-63](https://github.com/huggingface/transformers/blob/v5.4.0/src/transformers/models/gpt2/tokenization_gpt2.py#L54-L63):
```python
self._tokenizer = Tokenizer(
    BPE(
        vocab=self._vocab,
        merges=self._merges,
        dropout=None,
        continuing_subword_prefix="",
        end_of_word_suffix="",
        fuse_unk=False,
    )
)
```

No `tokenizer_file`/`tokenizer.json` reference exists in the v5 file. For the original `openai-community/gpt2` model, `vocab.json`/`merges.txt` and `tokenizer.json` are consistent, so both paths produce the same IDs. For `granite-4.0-micro`, whose `tokenizer.json` encodes a different BPE configuration, they diverge.

### Impact on downstream tasks

The LoRA adapters ([`ibm-granite/granitelib-rag-r1.0`](https://huggingface.co/ibm-granite/granitelib-rag-r1.0)) were trained with `tokenizer.json` token IDs. When inference uses different IDs, the adapter's learned weights are applied to the wrong token positions. Three of six RAG intrinsics regressed:

| Intrinsic | Drop | Why |
|-----------|------|-----|
| **Context Relevance** | −24.1pp (92.1% → 68.0%) | Short prompts with diverse document vocabulary — many tokens hit divergent BPE merges. The model predicted `"partially relevant"` instead of `"relevant"` for inputs that should have been clear matches. |
| **Query Rewrite** | −8.4pp (86.9% → 78.5%) | The model must reproduce input text verbatim in the rewritten query. Divergent token IDs cause character-level corruption in the output (e.g. `"2023"` → `"20 23"`, `"d.o.o"` → `"d.oo"`). |
| **Hallucination Detection** | −1.5pp (88.3% → 86.8%) | Long prompts where most tokens are common English words — fewer tokens affected relative to total prompt length. |

The other three intrinsics (answerability, clarification, citations) were unaffected because their eval scripts already used `PreTrainedTokenizerFast` or their prompts did not exercise the divergent BPE merges.

### Workaround

Use `PreTrainedTokenizerFast` directly, which loads `tokenizer.json`:

```python
from transformers import PreTrainedTokenizerFast

tokenizer = PreTrainedTokenizerFast.from_pretrained("ibm-granite/granite-4.0-micro")
```

### Environment

- Model: `ibm-granite/granite-4.0-micro`
- Transformers v4 baseline: 4.57.6 (uses `GPT2TokenizerFast` + `tokenizer.json`)
- Transformers v5 regression: 5.4.0 through 5.7.0 (uses `GPT2Tokenizer` + `vocab.json`/`merges.txt`)
- Python: 3.12

Input	v4.57.6 (`tokenizer.json`)	v5.4.0 (`vocab.json`/`merges.txt`)	v5.7.0 (unchanged)
`"2023"`	`[2366, 18]`	`[508, 1419]`	`[508, 1419]`
`"650841823"`	`[13655, 25496, 23848]`	`[13655, 5833, 972, 1419]`	`[13655, 5833, 972, 1419]`
`"409473852"`	`[12378, 21505, 24571]`	`[12378, 2618, 1987, 4103]`	`[12378, 2618, 1987, 4103]`
`"914588298"`	`[24579, 20691, 17690]`	`[24, 9591, 2421, 17690]`	`[24, 9591, 2421, 17690]`
`"60-138-3818"`	`[1399, 12, 10350, 12, 19162, 23]`	`[1399, 12, 10350, 12, 1987, 972]`	`[1399, 12, 10350, 12, 1987, 972]`
`"65-005-6716"`	`[2397, 12, 8504, 12, 23403, 21]`	`[2397, 12, 8504, 12, 3080, 845]`	`[2397, 12, 8504, 12, 3080, 845]`
`"d.o.o"`	`[67, 14778, 14778]`	`[67, 13, 78, 13, 78]`	`[67, 13, 78, 13, 78]`
`"D&B Score"`	`[35, 49339, 18607]`	`[35, 5, 33, 18607]`	`[35, 5, 33, 18607]`
`"corp.net"`	`[81827, 5181]`	`[81827, 13, 4816]`	`[81827, 13, 4816]`
`"FY2020"`	`[82029, 2366, 15]`	`[82029, 508, 508]`	`[82029, 508, 508]`
`"FY2023"`	`[82029, 2366, 18]`	`[82029, 508, 1419]`	`[82029, 508, 1419]`
`"Q3 2024"`	`[48, 18, 220, 2366, 19]`	`[48, 18, 220, 508, 1187]`	`[48, 18, 220, 508, 1187]`
`"H1 2025"`	`[39, 16, 220, 2366, 20]`	`[39, 16, 220, 508, 914]`	`[39, 16, 220, 508, 914]`
`"Broadcom in 2023"`	`[69424, 884, 304, 220, 2366, 18]`	`[69424, 884, 304, 220, 508, 1419]`	`[69424, 884, 304, 220, 508, 1419]`
`"Maruti Enterprises in 2022"`	`[12331, 32973, 67056, 304, 220, 2366, 17]`	`[12331, 32973, 67056, 304, 220, 508, 1313]`	`[12331, 32973, 67056, 304, 220, 508, 1313]`
`"spend in 2023"`	`[2203, 408, 304, 220, 2366, 18]`	`[2203, 408, 304, 220, 508, 1419]`	`[2203, 408, 304, 220, 508, 1419]`
`"NAICS 541512"`	`[7476, 19645, 220, 22058, 8358]`	`[7476, 19645, 220, 4370, 868, 717]`	`[7476, 19645, 220, 4370, 868, 717]`
`"IMAGINE d.o.o"`	`[1829, 1929, 4069, 294, 14778, 14778]`	`[1829, 1929, 4069, 294, 13, 78, 13, 78]`	`[1829, 1929, 4069, 294, 13, 78, 13, 78]`
`"ISO 9001:2015"`	`[25141, 220, 7467, 16, 25, 679, 20]`	`[25141, 220, 24, 4119, 25, 679, 20]`	`[25141, 220, 24, 4119, 25, 679, 20]`
`"ref#2847"`	`[1116, 2, 17058, 22]`	`[1116, 2, 1591, 2618]`	`[1116, 2, 1591, 2618]`
`"Hello world"`	`[9906, 1917]`	`[9906, 1917]` (match)	`[9906, 1917]` (match)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro #947

Summary

Ask: regression test in transformers

Token ID divergence

Why this happens

Impact on downstream tasks

Workaround

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Intrinsic	Drop	Why
Context Relevance	−24.1pp (92.1% → 68.0%)	Short prompts with diverse document vocabulary — many tokens hit divergent BPE merges. The model predicted `"partially relevant"` instead of `"relevant"` for inputs that should have been clear matches.
Query Rewrite	−8.4pp (86.9% → 78.5%)	The model must reproduce input text verbatim in the rewritten query. Divergent token IDs cause character-level corruption in the output (e.g. `"2023"` → `"20 23"`, `"d.o.o"` → `"d.oo"`).
Hallucination Detection	−1.5pp (88.3% → 86.8%)	Long prompts where most tokens are common English words — fewer tokens affected relative to total prompt length.

AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro #947

Description

Summary

Ask: regression test in transformers

Token ID divergence

Why this happens

Impact on downstream tasks

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions