fix: Gemma 4 audio — mel preprocessing, weight loading, feature extractor#931
Merged
Conversation
…ctor Fix four bugs preventing Gemma 4 audio from working: 1. Missing semicausal left-padding in audio feature extractor. The HF reference prepends frame_length//2 (160) zero samples before the unfold, centering the first frame at t=0. Without this, the mel spectrogram is misaligned and the frame count is wrong, which also causes the broadcast shapes error (issue Blaizzy#923). 2. Wrong Hann window formula. Used cos(2*pi*(n+0.5)/N) instead of the correct periodic Hann cos(2*pi*n/N). The +0.5 phase shift produces meaningfully different spectral values from what the model was trained on. 3. sanitize() double-nests language_model weights (issue Blaizzy#912). HF keys like model.language_model.model.embed_tokens.weight become language_model.model.embed_tokens.weight after stripping model., which already matches the MLX path. The unconditional insertion of .model. created language_model.model.model.*, so all LM weights loaded as zero. 4. Feature extractor not instantiated (issue Blaizzy#903). Only created when processor_config.json contains a "feature_extractor" key, which standard HF checkpoints don't include. Now instantiates with USM defaults unconditionally. Fixes Blaizzy#903, Blaizzy#912, Blaizzy#923
Owner
|
Thanks! |
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Fix four bugs preventing Gemma 4 audio (E2B/E4B) from producing correct transcriptions. These were discovered while implementing and debugging the Gemma 4 audio conformer encoder for llama.cpp (ggml-org/llama.cpp#21421), where we traced tensors between PyTorch and C++ to identify divergences.
Fixes
1. Missing semicausal left-padding (
audio_feature_extractor.py)The HF reference prepends
frame_length // 2(160) zero samples before the unfold, centering the first frame at t=0. Without this:2. Wrong Hann window formula (
audio_feature_extractor.py)Used
cos(2π(n+0.5)/N)instead of the correct periodic Hanncos(2πn/N). The+0.5phase shift produces meaningfully different spectral values. Verified against HF Transformers reference which useswindow_function(frame_length)= periodic Hann.3.
sanitize()double-nestsmodel.prefix (gemma4.py) — fixes #912HF checkpoint keys like
model.language_model.model.embed_tokens.weightbecomelanguage_model.model.embed_tokens.weightafter stripping the outermodel.prefix — this already matches the MLX attribute path. The unconditional.model.insertion createdlanguage_model.model.model.*, so all language model weights loaded as zero.4. Feature extractor not instantiated (
processing_gemma4.py) — fixes #903The feature extractor was only created when
processor_config.jsoncontains a"feature_extractor"key, which standard HF checkpoints don't include. Now instantiates with USM defaults unconditionally (the parameters are fixed for all Gemma 4 audio models: 16kHz, 128 mel bins, 20ms frame, 10ms hop, HTK scale).Additional information
Fixes #903, #912, #923
Note: I don't have an Apple Silicon machine to test this on, so these fixes are based on code review and our experience fixing the same issues in the llama.cpp C++ implementation. The mel preprocessing fixes (semicausal padding, Hann window) were verified against PyTorch with mel cosine similarity of 0.9998.
Requirements