Skip to content

Bug: Gemma4Tokenizer missing FORBIDDEN_TOKENS — multimodal placeholder tokens can leak into text-only output #613

@ac12644

Description

@ac12644

Description

Gemma4Tokenizer does not define FORBIDDEN_TOKENS, unlike Gemma3Tokenizer and Gemma3nTokenizer which both forbid multimodal placeholder tokens from being generated during sampling.

This means when sampling with a Gemma 4 model in text-only mode, the sampler has no restriction on generating raw image/audio placeholder tokens (<|image|>, <|image>, <image|>, <|audio|>, <|audio>, <audio|>), which would produce corrupted output.

Comparison

# Gemma3Tokenizer (line ~440) — correctly forbids image tokens:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma3nTokenizer (line ~465) — same:
FORBIDDEN_TOKENS = (
    special_tokens.START_OF_IMAGE,
    special_tokens.END_OF_IMAGE,
)

# Gemma4Tokenizer (line ~475) — MISSING, inherits empty tuple from base:
# (no FORBIDDEN_TOKENS defined)

How it's used

In gemma/gm/text/_sampler.py:501:

forbidden_tokens += self.tokenizer.FORBIDDEN_TOKENS

For Gemma4, this adds nothing, so multimodal tokens are never masked out.

Proposed Fix

Add FORBIDDEN_TOKENS to Gemma4Tokenizer covering all multimodal placeholder tokens:

class Gemma4Tokenizer(Tokenizer):
  ...
  FORBIDDEN_TOKENS = (
      special_tokens.IMAGE_PLACEHOLDER,
      special_tokens.START_OF_IMAGE,
      special_tokens.END_OF_IMAGE,
      special_tokens.AUDIO_PLACEHOLDER,
      special_tokens.START_OF_AUDIO,
      special_tokens.END_OF_AUDIO,
  )

Location

  • File: gemma/gm/text/_tokenizer.py
  • Class: Gemma4Tokenizer (around line 475)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions