Skip to content

Add FORBIDDEN_TOKENS to Gemma4Tokenizer covering image + audio placeholders#649

Open
dparikh79 wants to merge 1 commit into
google-deepmind:mainfrom
dparikh79:fix/613-gemma4-tokenizer-forbidden-tokens
Open

Add FORBIDDEN_TOKENS to Gemma4Tokenizer covering image + audio placeholders#649
dparikh79 wants to merge 1 commit into
google-deepmind:mainfrom
dparikh79:fix/613-gemma4-tokenizer-forbidden-tokens

Conversation

@dparikh79
Copy link
Copy Markdown

Summary

Gemma4Tokenizer did not define FORBIDDEN_TOKENS, so it inherited the base class default of (). The sampler at gemma/gm/text/_sampler.py:501 does:

forbidden_tokens = self._normalize_tokens(self.forbidden_tokens)
forbidden_tokens += self.tokenizer.FORBIDDEN_TOKENS

For Gemma 4 this added nothing, leaving the sampler free to emit raw multimodal placeholder tokens (<|image|>, <start_of_image>, <image|>, <|audio|>, <start_of_audio>, <audio|>) during text-only inference. The result is corrupted text-only output that includes raw placeholder ids the user never asked for.

Fix

Mirror the existing Gemma3Tokenizer / Gemma3nTokenizer pattern on the new class, but include all six distinct multimodal ids. Gemma 3 reuses IMAGE_PLACEHOLDER == START_OF_IMAGE == 255999, so listing both there is redundant; in Gemma 4 all six are distinct per _Gemma4SpecialTokens (verified directly in this file):

IMAGE_PLACEHOLDER = 258880  # <|image|>
START_OF_IMAGE = 255999     # <|image>
END_OF_IMAGE = 258882       # <image|>
AUDIO_PLACEHOLDER = 258881  # <|audio|>
START_OF_AUDIO = 256000     # <|audio> (BOA)
END_OF_AUDIO = 258883       # <audio|> (EOA)

so all six need to be forbidden to keep the sampler from emitting any of them in text-only mode.

Test Plan

  • Added test_gemma4_tokenizer_forbids_multimodal_placeholder_tokens in gemma/gm/text/_tokenizer_test.py. It iterates over all six expected multimodal ids and asserts each is present in Gemma4Tokenizer.FORBIDDEN_TOKENS. This is a class-attribute test, so it does not need to load the actual SentencePiece model.
  • Verified via AST that the new FORBIDDEN_TOKENS tuple contains exactly the six expected special_tokens.* entries.
  • No regression for Gemma3Tokenizer / Gemma3nTokenizer (their FORBIDDEN_TOKENS is unchanged).
  • No em-dashes / no AI-tells / Google CLA already signed (covered via the mujoco / dm_control umbrella).

Fixes #613.

Credit to @ac12644 for the diagnosis and proposed fix in the issue.

…olders

`Gemma4Tokenizer` did not define `FORBIDDEN_TOKENS`, so it inherited
the base class default of `()`. The sampler at
`gemma/gm/text/_sampler.py:501` adds
`self.tokenizer.FORBIDDEN_TOKENS` to the per-call forbidden set; for
Gemma 4 that meant nothing was added, and text-only sampling could
emit raw multimodal placeholder tokens (`<|image|>`,
`<start_of_image>`, `<image|>`, `<|audio|>`, `<start_of_audio>`,
`<audio|>`), producing corrupted text-only output.

Mirror the existing `Gemma3Tokenizer` / `Gemma3nTokenizer` pattern
on the new class, but include all six distinct ids because Gemma 4
assigns different token ids to each placeholder (Gemma 3 reuses
`IMAGE_PLACEHOLDER == START_OF_IMAGE == 255999`, so listing both
there is redundant; in Gemma 4 all six are distinct per
`_Gemma4SpecialTokens` and all six must be forbidden).

Regression test added in `_tokenizer_test.py` asserts that every one
of the six multimodal ids appears in
`Gemma4Tokenizer.FORBIDDEN_TOKENS`.

Fixes google-deepmind#613.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output

1 participant