Add Gemma 4 Audio Processing Primitives and Feature Extractor#471
Open
ghchinoy wants to merge 3 commits intoml-explore:mainfrom
Open
Add Gemma 4 Audio Processing Primitives and Feature Extractor#471ghchinoy wants to merge 3 commits intoml-explore:mainfrom
ghchinoy wants to merge 3 commits intoml-explore:mainfrom
Conversation
- Encapsulate 2D `causalValidMask` generation natively inside `AudioEncoder`. - Fix array slicing logic to slice purely on the time axis using explicit multidimensional slice notation. - Replace brittle `einsum` operation with a stable combination of `.transposed()` and `.matmul()` to ensure robust matrix dimension broadcasting. - Update tests and model wrapper template to reflect the cleaner `AudioEncoder` API that no longer requires an external causal mask. - Add `MLXFast` dependency to `Gemma4Audio` package target.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces the core acoustic features for the Multimodal Gemma 4 model into the
mlx-swift-examplesrepository. It provides the native Swift implementations for Gemma 4's audio encoder and DSP preprocessing.Motivation
As Apple MLX and Swift grow in supporting multimodal tasks, developers need native representations of complex architectures like Gemma 4's audio encoder to run models efficiently on iOS and macOS. This PR brings over the necessary
MLXNN.Modulestructs andMLXFFTimplementations fromml-explore/mlx-vlm.Added Components
This PR introduces a standalone
Gemma4Audiolibrary containing the following additions:AudioRMSNorm: A specific normalization layer applying learnable weights directly without offsets.SSCPConvBlock&SubSampleConvProjection: 2D convolutional modules with proper symmetric padding to downsample the time/frequency domains of the mel-spectrograms.ClippableLinear: A linear projection layer that supports safetensors-based dynamic input/output bounds clipping (MLX.clip), required for Gemma 4 stability.ConformerFeedForward&ConformerLightConv1d: Macaron-style FFN with residual scaling and causally-padded depthwise 1D convolutions leveragingMLXNN.Conv1d(groups: hiddenSize).AudioRelativePositionEmbedding&AudioAttention: The core mechanism using dynamically generated sinusoidal timing signals, chunked local contexts, and 5D tensorMLX.einsumoperations ("bnuwc,bucnh->buwnh") with logit softcapping.ConformerBlock,AudioEncoder&MultimodalEmbedder: The wrapper modules chaining the layers into the full neural backbone.Gemma4AudioFeatureExtractor: A nativeMLXFFT.rfftimplementation to process raw audio.wavwaveforms into mel-spectrograms, supporting HTK-style preemphasis, filtering, andasStridedoverlapping frames.Gemma4AudioModel(Template): A template wrapper class demonstrating how to conform toLLMModel, projecting the audio features and concatenating them with text embeddings before passing them to the baseLanguageModel(e.g.Gemma3TextModel).Ecosystem Staging Plan & Next Steps
Since the
mlx-swift-examplesserve as the primary hub for testing new multimodal architectures, I'm incubatingGemma4Audiohere first. This is the audio "ears" and a bridge to the model.The immediate next step I'm looking forward to is collaborating with the Apple MLX engineering team to integrate this architecture into the official
ml-explore/mlx-swift-lmrepository. Specifically, theGemma4AudioModeltemplate as a blueprint, and waiting on the official boilerplate for theLLMModelconformance and autoregressive generation loop (MLXLM.generate()) to be implemented upstream so developers can use it seamlessly without writing custom tokenization and KV cache logic.Once the API proves stable and the core language models are updated to support it natively, I plan to open, or look forward to the community opening, a follow-up PR to migrate these components fully into the
ml-explore/mlx-swift-lmrepository (underMLXLLMorMLXVLM).Testing
Comprehensive unit tests have been written (
Gemma4AudioTests.swift) to validate configuration, convolutional striding, matrix dimensional shapes (e.g.[Batch, DecimatedTime, HiddenSize]), and feature extractor correctness on a sample 16kHz audio sequence. All tests compile and run properly.Checklist
AudioFeatureExtractorto compute mel filterbanks natively.XCTestsuite to validate tensor shape pipelines.swift-formaton all new source files.🥇 Thank you to the
mlx-vlmandmlx-swiftcommunity for the foundations!Resolves ml-explore/mlx-swift-lm#207