Add Gemma 4 Audio Processing Primitives and Feature Extractor by ghchinoy · Pull Request #471 · ml-explore/mlx-swift-examples

ghchinoy · 2026-04-12T18:32:31Z

Description

This PR introduces the core acoustic features for the Multimodal Gemma 4 model into the mlx-swift-examples repository. It provides the native Swift implementations for Gemma 4's audio encoder and DSP preprocessing.

Motivation

As Apple MLX and Swift grow in supporting multimodal tasks, developers need native representations of complex architectures like Gemma 4's audio encoder to run models efficiently on iOS and macOS. This PR brings over the necessary MLXNN.Module structs and MLXFFT implementations from ml-explore/mlx-vlm.

Added Components

This PR introduces a standalone Gemma4Audio library containing the following additions:

AudioRMSNorm: A specific normalization layer applying learnable weights directly without offsets.
SSCPConvBlock & SubSampleConvProjection: 2D convolutional modules with proper symmetric padding to downsample the time/frequency domains of the mel-spectrograms.
ClippableLinear: A linear projection layer that supports safetensors-based dynamic input/output bounds clipping (MLX.clip), required for Gemma 4 stability.
ConformerFeedForward & ConformerLightConv1d: Macaron-style FFN with residual scaling and causally-padded depthwise 1D convolutions leveraging MLXNN.Conv1d(groups: hiddenSize).
AudioRelativePositionEmbedding & AudioAttention: The core mechanism using dynamically generated sinusoidal timing signals, chunked local contexts, and 5D tensor MLX.einsum operations ("bnuwc,bucnh->buwnh") with logit softcapping.
ConformerBlock, AudioEncoder & MultimodalEmbedder: The wrapper modules chaining the layers into the full neural backbone.
Gemma4AudioFeatureExtractor: A native MLXFFT.rfft implementation to process raw audio .wav waveforms into mel-spectrograms, supporting HTK-style preemphasis, filtering, and asStrided overlapping frames.
Gemma4AudioModel (Template): A template wrapper class demonstrating how to conform to LLMModel, projecting the audio features and concatenating them with text embeddings before passing them to the base LanguageModel (e.g. Gemma3TextModel).

Ecosystem Staging Plan & Next Steps

Since the mlx-swift-examples serve as the primary hub for testing new multimodal architectures, I'm incubating Gemma4Audio here first. This is the audio "ears" and a bridge to the model.

The immediate next step I'm looking forward to is collaborating with the Apple MLX engineering team to integrate this architecture into the official ml-explore/mlx-swift-lm repository. Specifically, the Gemma4AudioModel template as a blueprint, and waiting on the official boilerplate for the LLMModel conformance and autoregressive generation loop (MLXLM.generate()) to be implemented upstream so developers can use it seamlessly without writing custom tokenization and KV cache logic.

Once the API proves stable and the core language models are updated to support it natively, I plan to open, or look forward to the community opening, a follow-up PR to migrate these components fully into the ml-explore/mlx-swift-lm repository (under MLXLLM or MLXVLM).

Testing

Comprehensive unit tests have been written (Gemma4AudioTests.swift) to validate configuration, convolutional striding, matrix dimensional shapes (e.g. [Batch, DecimatedTime, HiddenSize]), and feature extractor correctness on a sample 16kHz audio sequence. All tests compile and run properly.

Checklist

Ported all Gemma 4 acoustic modules to Swift MLX.
Ported DSP AudioFeatureExtractor to compute mel filterbanks natively.
Written XCTest suite to validate tensor shape pipelines.
Executed swift-format on all new source files.

🥇 Thank you to the mlx-vlm and mlx-swift community for the foundations!

Resolves ml-explore/mlx-swift-lm#207

- Encapsulate 2D `causalValidMask` generation natively inside `AudioEncoder`. - Fix array slicing logic to slice purely on the time axis using explicit multidimensional slice notation. - Replace brittle `einsum` operation with a stable combination of `.transposed()` and `.matmul()` to ensure robust matrix dimension broadcasting. - Update tests and model wrapper template to reflect the cleaner `AudioEncoder` API that no longer requires an external causal mask. - Add `MLXFast` dependency to `Gemma4Audio` package target.

ghchinoy added 3 commits April 12, 2026 11:20

Add Gemma 4 Audio Processing Primitives and Feature Extractor

21ec032

Add Gemma4AudioModel template wrapper

97d74b6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 4 Audio Processing Primitives and Feature Extractor#471

Add Gemma 4 Audio Processing Primitives and Feature Extractor#471
ghchinoy wants to merge 3 commits intoml-explore:mainfrom
ghchinoy:main

ghchinoy commented Apr 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ghchinoy commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Added Components

Ecosystem Staging Plan & Next Steps

Testing

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ghchinoy commented Apr 12, 2026 •

edited

Loading