Skip to content

Add Gemma 4 Audio Processing Primitives and Feature Extractor#471

Open
ghchinoy wants to merge 3 commits intoml-explore:mainfrom
ghchinoy:main
Open

Add Gemma 4 Audio Processing Primitives and Feature Extractor#471
ghchinoy wants to merge 3 commits intoml-explore:mainfrom
ghchinoy:main

Conversation

@ghchinoy
Copy link
Copy Markdown

@ghchinoy ghchinoy commented Apr 12, 2026

Description

This PR introduces the core acoustic features for the Multimodal Gemma 4 model into the mlx-swift-examples repository. It provides the native Swift implementations for Gemma 4's audio encoder and DSP preprocessing.

Motivation

As Apple MLX and Swift grow in supporting multimodal tasks, developers need native representations of complex architectures like Gemma 4's audio encoder to run models efficiently on iOS and macOS. This PR brings over the necessary MLXNN.Module structs and MLXFFT implementations from ml-explore/mlx-vlm.

Added Components

This PR introduces a standalone Gemma4Audio library containing the following additions:

  1. AudioRMSNorm: A specific normalization layer applying learnable weights directly without offsets.
  2. SSCPConvBlock & SubSampleConvProjection: 2D convolutional modules with proper symmetric padding to downsample the time/frequency domains of the mel-spectrograms.
  3. ClippableLinear: A linear projection layer that supports safetensors-based dynamic input/output bounds clipping (MLX.clip), required for Gemma 4 stability.
  4. ConformerFeedForward & ConformerLightConv1d: Macaron-style FFN with residual scaling and causally-padded depthwise 1D convolutions leveraging MLXNN.Conv1d(groups: hiddenSize).
  5. AudioRelativePositionEmbedding & AudioAttention: The core mechanism using dynamically generated sinusoidal timing signals, chunked local contexts, and 5D tensor MLX.einsum operations ("bnuwc,bucnh->buwnh") with logit softcapping.
  6. ConformerBlock, AudioEncoder & MultimodalEmbedder: The wrapper modules chaining the layers into the full neural backbone.
  7. Gemma4AudioFeatureExtractor: A native MLXFFT.rfft implementation to process raw audio .wav waveforms into mel-spectrograms, supporting HTK-style preemphasis, filtering, and asStrided overlapping frames.
  8. Gemma4AudioModel (Template): A template wrapper class demonstrating how to conform to LLMModel, projecting the audio features and concatenating them with text embeddings before passing them to the base LanguageModel (e.g. Gemma3TextModel).

Ecosystem Staging Plan & Next Steps

Since the mlx-swift-examples serve as the primary hub for testing new multimodal architectures, I'm incubating Gemma4Audio here first. This is the audio "ears" and a bridge to the model.

The immediate next step I'm looking forward to is collaborating with the Apple MLX engineering team to integrate this architecture into the official ml-explore/mlx-swift-lm repository. Specifically, the Gemma4AudioModel template as a blueprint, and waiting on the official boilerplate for the LLMModel conformance and autoregressive generation loop (MLXLM.generate()) to be implemented upstream so developers can use it seamlessly without writing custom tokenization and KV cache logic.

Once the API proves stable and the core language models are updated to support it natively, I plan to open, or look forward to the community opening, a follow-up PR to migrate these components fully into the ml-explore/mlx-swift-lm repository (under MLXLLM or MLXVLM).

Testing

Comprehensive unit tests have been written (Gemma4AudioTests.swift) to validate configuration, convolutional striding, matrix dimensional shapes (e.g. [Batch, DecimatedTime, HiddenSize]), and feature extractor correctness on a sample 16kHz audio sequence. All tests compile and run properly.

Checklist

  • Ported all Gemma 4 acoustic modules to Swift MLX.
  • Ported DSP AudioFeatureExtractor to compute mel filterbanks natively.
  • Written XCTest suite to validate tensor shape pipelines.
  • Executed swift-format on all new source files.

🥇 Thank you to the mlx-vlm and mlx-swift community for the foundations!

Resolves ml-explore/mlx-swift-lm#207

- Encapsulate 2D `causalValidMask` generation natively inside `AudioEncoder`.
- Fix array slicing logic to slice purely on the time axis using explicit multidimensional slice notation.
- Replace brittle `einsum` operation with a stable combination of `.transposed()` and `.matmul()` to ensure robust matrix dimension broadcasting.
- Update tests and model wrapper template to reflect the cleaner `AudioEncoder` API that no longer requires an external causal mask.
- Add `MLXFast` dependency to `Gemma4Audio` package target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Add Support for Multimodal Gemma 4 Audio Encoder & Preprocessing

1 participant