Skip to content

test: add tokenizer regression test for granite-4.0-micro#948

Draft
kndtran wants to merge 2 commits intogenerative-computing:mainfrom
kndtran:test/tokenizer-regression-947
Draft

test: add tokenizer regression test for granite-4.0-micro#948
kndtran wants to merge 2 commits intogenerative-computing:mainfrom
kndtran:test/tokenizer-regression-947

Conversation

@kndtran
Copy link
Copy Markdown

@kndtran kndtran commented Apr 27, 2026

Misc PR

Type of PR

  • Bug Fix
  • New Feature
  • Documentation
  • Other

Description

  • Add regression test for granite-4.0-micro tokenizer stability
  • Asserts golden token IDs, tokenizer class, and roundtrip decode
  • Catches transformers upgrades that change AutoTokenizer resolution (e.g. v5 returning GPT2Tokenizer instead of GPT2TokenizerFast)
  • 26 tests, runs in ~13s, no GPU needed

Testing

  • Tests added to the respective file if code was changed
  • New code has 100% coverage if code as added
  • Ensure existing tests and github automation passes (a maintainer will kick off the github automation when the rest of the PR is populated)

Additional

  • uv run pytest test/formatters/granite/test_tokenizer_regression.py -v — 26 passed
  • uv run ruff check and uv run ruff format --check — clean

Attribution

  • AI coding assistants used: Claude Code Opus 4.6

…-computing#947)

Assert golden token IDs and tokenizer class to catch transformers
upgrades that change AutoTokenizer resolution for granite-4.0-micro.
Motivated by transformers v5 returning GPT2Tokenizer (constructs BPE
from vocab.json/merges.txt) instead of GPT2TokenizerFast (loads
tokenizer.json), which produced different token IDs and broke 3/6 RAG
intrinsics.

Closes generative-computing#947

Assisted-by: Claude Code
Signed-off-by: Khoi-Nguyen Tran <kndtran@ibm.com>
@kndtran kndtran requested a review from a team as a code owner April 27, 2026 21:30
@psschwei
Copy link
Copy Markdown
Member

original body:

Summary

  • Add regression test for granite-4.0-micro tokenizer stability
  • Asserts golden token IDs, tokenizer class, and roundtrip decode
  • Catches transformers upgrades that change AutoTokenizer resolution (e.g. v5 returning GPT2Tokenizer instead of GPT2TokenizerFast)
  • 26 tests, runs in ~13s, no GPU needed

Closes #947

Test plan

  • uv run pytest test/formatters/granite/test_tokenizer_regression.py -v — 26 passed
  • uv run ruff check and uv run ruff format --check — clean

@github-actions
Copy link
Copy Markdown
Contributor

The PR description has been updated. Please fill out the template for your PR to be reviewed.

@kndtran kndtran marked this pull request as draft April 28, 2026 03:16
…enerative-computing#947)

Replace full-query golden entries and intrinsic output labels with 20
diverse corrupted snippets extracted from context_relevance and
query_rewrite eval data. Coverage now spans pure numbers, DUNS,
company suffixes, fiscal year compounds, industry codes, standards,
and mixed text+number phrases — all confirmed to diverge under tf5.

Assisted-by: Claude Code
@kndtran kndtran marked this pull request as ready for review April 28, 2026 18:28
@kndtran
Copy link
Copy Markdown
Author

kndtran commented Apr 28, 2026

@psschwei I updated the PR template with the details.

@frreiss
Copy link
Copy Markdown
Collaborator

frreiss commented Apr 28, 2026

@kndtran I recommend that we come up with an explanation for this problem that makes sense before checking a workaround into Mellea.

The current analysis from Claude produces more questions than answers. Specific important questions:

  • Why did the Hugging Face team design AutoTokenizer.from_pretrained(), the primary and recommended code path for initializing a tokenizer, such that it returns an incorrectly-configured tokenizer for Granite 4.0?
  • Why did the Hugging Face team design PretrainedTokenizerFast.from_pretrained(), the primary and recommended code path for initializing a Rust tokenizer, such that it returns a different tokenizer than AutoTokenizer.from_pretrained()? The release notes say that there is exactly one type of tokenizer.
  • Does this problem affect other models? If so, why is there not a major blocker issue raised against Transformers?
  • If this problem does not affect other models, what is special about Granite that causes this problem to affect Granite? Is there a regression in the Granite code or configuration that needs to be fixed?

@kndtran
Copy link
Copy Markdown
Author

kndtran commented Apr 28, 2026

@frreiss I got Claude to refine the details and evidence in the referenced issue #947. This PR is just a simple regression test for the expected input IDs, specifically for our intrinsics use case, so no workarounds yet (only suggested) and it's not a general test in mellea.

Let me try to address those questions in the referenced issue. The release notes is light on the details of the tokenizer consolidation, so I (with Claude's help) had to discover the diff changes in the HF code.


@pytest.fixture(scope="module")
def tokenizer():
return transformers.AutoTokenizer.from_pretrained(MODEL_ID)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line makes no sense to me. Why are we testing the correctness of loading the tokenizer the wrong way? Per the comment above, this line should be calling PreTrainedTokenizerFast.from_pretrained(). Why do these tests not fail?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mellea is still currently on transformers < 5:

"transformers>=4.53.2,<5",

This test will fail on migration to TF5 and will need to be updated with the workaround, if that's the choice,or there is an official fix in transformers.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kndtran, I haven't been following this thread, but we are trying to move to transformers v5. It looks like our most immediate blocker has been fixed but I don't have an ETA for you: #418

@kndtran
Copy link
Copy Markdown
Author

kndtran commented Apr 28, 2026

Hmm ok, a workaround will have to be implemented if there was a migration to TF5 in mellea as this test will fail. Let me mark this as draft for now, and work on the deeper explanation of this problem in the referenced issue.

Copy link
Copy Markdown
Contributor

@planetf1 planetf1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a note on test gating -- rest will wait for a v5 upgrade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AutoTokenizer resolves to GPT2Tokenizer in transformers v5, producing different token IDs for granite-4.0-micro

5 participants