Skip to content

fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor#931

Merged
Blaizzy merged 5 commits into
Blaizzy:mainfrom
stephencox-ict:fix/gemma4-audio
Apr 7, 2026
Merged

fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor#931
Blaizzy merged 5 commits into
Blaizzy:mainfrom
stephencox-ict:fix/gemma4-audio

Conversation

@stephencox-ict
Copy link
Copy Markdown
Contributor

Overview

Fix four bugs preventing Gemma 4 audio (E2B/E4B) from producing correct transcriptions. These were discovered while implementing and debugging the Gemma 4 audio conformer encoder for llama.cpp (ggml-org/llama.cpp#21421), where we traced tensors between PyTorch and C++ to identify divergences.

Fixes

1. Missing semicausal left-padding (audio_feature_extractor.py)

The HF reference prepends frame_length // 2 (160) zero samples before the unfold, centering the first frame at t=0. Without this:

  • The mel spectrogram is misaligned vs what the model was trained on
  • The frame count is systematically fewer, causing the token count mismatch that triggers the broadcast shapes error (Gemma4 Value Error - Broadcast Shapes #923)

2. Wrong Hann window formula (audio_feature_extractor.py)

Used cos(2π(n+0.5)/N) instead of the correct periodic Hann cos(2πn/N). The +0.5 phase shift produces meaningfully different spectral values. Verified against HF Transformers reference which uses window_function(frame_length) = periodic Hann.

3. sanitize() double-nests model. prefix (gemma4.py) — fixes #912

HF checkpoint keys like model.language_model.model.embed_tokens.weight become language_model.model.embed_tokens.weight after stripping the outer model. prefix — this already matches the MLX attribute path. The unconditional .model. insertion created language_model.model.model.*, so all language model weights loaded as zero.

4. Feature extractor not instantiated (processing_gemma4.py) — fixes #903

The feature extractor was only created when processor_config.json contains a "feature_extractor" key, which standard HF checkpoints don't include. Now instantiates with USM defaults unconditionally (the parameters are fixed for all Gemma 4 audio models: 16kHz, 128 mel bins, 20ms frame, 10ms hop, HTK scale).

Additional information

Fixes #903, #912, #923

Note: I don't have an Apple Silicon machine to test this on, so these fixes are based on code review and our experience fixing the same issues in the llama.cpp C++ implementation. The mel preprocessing fixes (semicausal padding, Hann window) were verified against PyTorch with mel cosine similarity of 0.9998.

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES — Claude Code was used to assist with code review and identifying the bugs. All fixes are based on manual tracing against the HuggingFace Transformers reference implementation.

stephencox and others added 5 commits April 5, 2026 19:42
…ctor

Fix four bugs preventing Gemma 4 audio from working:

1. Missing semicausal left-padding in audio feature extractor.
   The HF reference prepends frame_length//2 (160) zero samples before
   the unfold, centering the first frame at t=0. Without this, the mel
   spectrogram is misaligned and the frame count is wrong, which also
   causes the broadcast shapes error (issue Blaizzy#923).

2. Wrong Hann window formula. Used cos(2*pi*(n+0.5)/N) instead of the
   correct periodic Hann cos(2*pi*n/N). The +0.5 phase shift produces
   meaningfully different spectral values from what the model was
   trained on.

3. sanitize() double-nests language_model weights (issue Blaizzy#912).
   HF keys like model.language_model.model.embed_tokens.weight become
   language_model.model.embed_tokens.weight after stripping model.,
   which already matches the MLX path. The unconditional insertion of
   .model. created language_model.model.model.*, so all LM weights
   loaded as zero.

4. Feature extractor not instantiated (issue Blaizzy#903). Only created when
   processor_config.json contains a "feature_extractor" key, which
   standard HF checkpoints don't include. Now instantiates with USM
   defaults unconditionally.

Fixes Blaizzy#903, Blaizzy#912, Blaizzy#923
@Blaizzy Blaizzy merged commit 3472132 into Blaizzy:main Apr 7, 2026
1 check passed
@Blaizzy
Copy link
Copy Markdown
Owner

Blaizzy commented Apr 7, 2026

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4: sanitize() duplicates 'model.' prefix, all weights load as zero Gemma 4 E2B/E4B audio produces gibberish — two issues found

3 participants