Add Gemma 3n E4B audio encoder (Conformer) support by vahsaechao · Pull Request #194 · ml-explore/mlx-swift-lm

vahsaechao · 2026-04-07T22:57:53Z

Proposed changes

Adds the audio encoder for Gemma 3n E4B multimodal models. 12-layer USM-style Conformer that converts mel spectrograms into embeddings for the language model.

This is the first audio modality support in mlx-swift-lm. The encoder was ported from the HuggingFace Python implementation and tested on iphone 17 pro max.

Architecture:

mel [1, T, 128] → SubSampleConvProjection (4x reduction)
  → 12 Conformer blocks (chunked local attention, depthwise conv1d, cumulative group norm)
  → Temporal reduction (4x)
  → AudioEmbedder (1536 → 2048)
  → LM token stream

New files:

Gemma3nAudio.swift - Full Conformer encoder: chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub sample conv projection
Gemma3nAudioConfig.swift - 28 audio encoder configuration parameters with Codable
Gemma3nVLM.swift - Top level VLM wrapper with audio embedding injection and multimodal embedder
Gemma3nAudioTests.swift - Configuration decoding and encoder shape tests
Factory registration in VLMModelFactory.swift

Updated Files:

Gemma3nAudio.swift - split attention, norm, conv, and encoder classes into separate files
Gemma3nAudioAttention.swift: relative position embeddings, chunked attention
Gemma3nAudioNorm.swift: cumulative group normalization
Gemma3nAudioConv.swift: SSCP conv blocks, subsampling projection
Gemma3nAudio.swift: conformer blocks, top-level encoder
Gemma3nVLM.swift: preconditionFailure on unimplemented processor stub

Tests:

Test Suite 'Selected tests' started at 2026-04-15 11:32:48.538.
Test Suite 'mlx-swift-lmPackageTests.xctest' started at 2026-04-15 11:32:48.538.
Test Suite 'Gemma3nAudioTests' started at 2026-04-15 11:32:48.538.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioConfigDecoding]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioConfigDecoding]' passed (0.001 seconds).
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioEncoderShapes]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioEncoderShapes]' passed (0.054 seconds).
Test Case '-[MLXLMTests.Gemma3nAudioTests testMultimodalConfigDecoding]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testMultimodalConfigDecoding]' passed (0.000 seconds).
Test Suite 'Gemma3nAudioTests' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.055) seconds
Test Suite 'mlx-swift-lmPackageTests.xctest' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.055) seconds
Test Suite 'Selected tests' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.056) seconds
Audio encoder output shape: [1, 3, 1536]
Audio encoder mask shape: [1, 3]
◇ Test run started.
↳ Testing Library Version: 1501
↳ Target Platform: arm64e-apple-macos14.0
✔ Test run with 0 tests in 0 suites passed after 0.001 seconds.

iphone 17 pro max results: 15.8s audio → 99 tokens in 0.48s (Apple Neural Engine)

Note: The processor includes a stub for mel spectrogram preprocessing. A follow up PR can add the full audio processor pipeline (WAV → mel → encoder). The encoder itself is complete and tested.

Checklist

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

vahsaechao

Audio modality in mlx-swift-lm. The Conformer encoder is fully ported and tested on iphone 17 pro max.

davidkoski · 2026-04-13T19:06:49Z

See also #192, maybe work with @antmanler on combining these on top of #180 (merged into main)? Thank you!

12-layer USM-style Conformer audio encoder for Gemma 3n E4B multimodal models. Enables on-device audio encoding: mel spectrogram → Conformer → embedder pipeline. New files: - Gemma3nAudio.swift: Full Conformer port (chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub-sample conv projection) - Gemma3nAudioConfig.swift: 28 audio encoder configuration parameters - Gemma3nVLM.swift: Top-level VLM wrapper with audio embedding injection - Gemma3nAudioTests.swift: Configuration decoding tests Architecture: mel [1,T,128] → SubSampleConv (4x) → 12 Conformer blocks → temporal reduction (4x) → AudioEmbedder (1536→2048) → LM token stream Tested on iPhone: 15.8s audio → 99 tokens in 0.48s

vahsaechao · 2026-04-15T09:06:31Z

Thanks @davidkoski, I've synced with @antmanler on #192 and we agreed to merge both PRs independently and consolidate the audio extractor PR as a followup. Rebased on main.

- Gemma3nAudioAttention.swift: relative position embeddings, chunked attention - Gemma3nAudioNorm.swift: cumulative group normalization - Gemma3nAudioConv.swift: SSCP conv blocks, subsampling projection - Gemma3nAudio.swift: conformer blocks, top-level encoder - Gemma3nVLM.swift: preconditionFailure on unimplemented processor stub

vahsaechao · 2026-04-15T18:49:59Z

COMMENT — one guard needed

Impressive first audio modality contribution. Thorough code comments explaining the math. Clean separation of audio config, encoder, and VLM wrapper.

Requested fixes:

[MEDIUM] Processor contains a stub for mel spectrogram preprocessing — please add a preconditionFailure or fatalError guard on the stub path to prevent silent misuse before the follow-up PR lands

[MEDIUM] Gemma3nAudio.swift at 991 lines — consider splitting attention, norm, conv, and encoder classes into separate files

[LOW] No integration test that loads real weights

Thanks for the review! Addressed all three items in the latest push:

[MEDIUM] Processor stub guard. Replaced the silent no-op in Gemma3nAudioVLMProcessor.prepare() with preconditionFailure.

[MEDIUM] File Split Gemma3nAudio.swift:

Gemma3nAudioAttention.swift relative position embeddings, chunked attention
Gemma3nAudioNorm.swift cumulative group normalization
Gemma3nAudioConv.swift SSCP conv blocks, subsampling projection
Gemma3nAudio.swift — conformer blocks, top-level encoder

[LOW] Tests — All 3 existing Swift tests pass after the split
Encoder output shape verified: [1, 3, 1536] — correct after 16x temporal reduction.

davidkoski · 2026-05-13T23:12:17Z

OK, I think we probably need to make it through #194 first then -- we need the basic support for audio added (e.g. in UserInput).

davidkoski · 2026-05-18T18:13:04Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gemma 3n E4B audio encoder (Conformer) support#194

Add Gemma 3n E4B audio encoder (Conformer) support#194
vahsaechao wants to merge 2 commits into
ml-explore:mainfrom
vahsaechao:model/mlx-swift-gemma3n-e4b

vahsaechao commented Apr 7, 2026 •

edited

Loading

Uh oh!

vahsaechao left a comment

Uh oh!

davidkoski commented Apr 13, 2026

Uh oh!

vahsaechao commented Apr 15, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

vahsaechao commented Apr 15, 2026

Uh oh!

davidkoski commented May 13, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

vahsaechao commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed changes

Checklist

Uh oh!

vahsaechao left a comment

Choose a reason for hiding this comment

Uh oh!

davidkoski commented Apr 13, 2026

Uh oh!

vahsaechao commented Apr 15, 2026

Uh oh!

This comment was marked as spam.

Uh oh!

vahsaechao commented Apr 15, 2026

Uh oh!

davidkoski commented May 13, 2026

Uh oh!

davidkoski commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vahsaechao commented Apr 7, 2026 •

edited

Loading