Add FORBIDDEN_TOKENS to Gemma4Tokenizer covering image + audio placeholders#649
Open
dparikh79 wants to merge 1 commit into
Open
Conversation
…olders `Gemma4Tokenizer` did not define `FORBIDDEN_TOKENS`, so it inherited the base class default of `()`. The sampler at `gemma/gm/text/_sampler.py:501` adds `self.tokenizer.FORBIDDEN_TOKENS` to the per-call forbidden set; for Gemma 4 that meant nothing was added, and text-only sampling could emit raw multimodal placeholder tokens (`<|image|>`, `<start_of_image>`, `<image|>`, `<|audio|>`, `<start_of_audio>`, `<audio|>`), producing corrupted text-only output. Mirror the existing `Gemma3Tokenizer` / `Gemma3nTokenizer` pattern on the new class, but include all six distinct ids because Gemma 4 assigns different token ids to each placeholder (Gemma 3 reuses `IMAGE_PLACEHOLDER == START_OF_IMAGE == 255999`, so listing both there is redundant; in Gemma 4 all six are distinct per `_Gemma4SpecialTokens` and all six must be forbidden). Regression test added in `_tokenizer_test.py` asserts that every one of the six multimodal ids appears in `Gemma4Tokenizer.FORBIDDEN_TOKENS`. Fixes google-deepmind#613.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gemma4Tokenizerdid not defineFORBIDDEN_TOKENS, so it inherited the base class default of(). The sampler atgemma/gm/text/_sampler.py:501does:For Gemma 4 this added nothing, leaving the sampler free to emit raw multimodal placeholder tokens (
<|image|>,<start_of_image>,<image|>,<|audio|>,<start_of_audio>,<audio|>) during text-only inference. The result is corrupted text-only output that includes raw placeholder ids the user never asked for.Fix
Mirror the existing
Gemma3Tokenizer/Gemma3nTokenizerpattern on the new class, but include all six distinct multimodal ids. Gemma 3 reusesIMAGE_PLACEHOLDER == START_OF_IMAGE == 255999, so listing both there is redundant; in Gemma 4 all six are distinct per_Gemma4SpecialTokens(verified directly in this file):so all six need to be forbidden to keep the sampler from emitting any of them in text-only mode.
Test Plan
test_gemma4_tokenizer_forbids_multimodal_placeholder_tokensingemma/gm/text/_tokenizer_test.py. It iterates over all six expected multimodal ids and asserts each is present inGemma4Tokenizer.FORBIDDEN_TOKENS. This is a class-attribute test, so it does not need to load the actual SentencePiece model.FORBIDDEN_TOKENStuple contains exactly the six expectedspecial_tokens.*entries.Gemma3Tokenizer/Gemma3nTokenizer(theirFORBIDDEN_TOKENSis unchanged).Fixes #613.
Credit to @ac12644 for the diagnosis and proposed fix in the issue.