Skip to content

Add Gemma 3n E4B audio encoder (Conformer) support#194

Open
vahsaechao wants to merge 2 commits into
ml-explore:mainfrom
vahsaechao:model/mlx-swift-gemma3n-e4b
Open

Add Gemma 3n E4B audio encoder (Conformer) support#194
vahsaechao wants to merge 2 commits into
ml-explore:mainfrom
vahsaechao:model/mlx-swift-gemma3n-e4b

Conversation

@vahsaechao
Copy link
Copy Markdown

@vahsaechao vahsaechao commented Apr 7, 2026

Proposed changes

Adds the audio encoder for Gemma 3n E4B multimodal models. 12-layer USM-style Conformer that converts mel spectrograms into embeddings for the language model.

This is the first audio modality support in mlx-swift-lm. The encoder was ported from the HuggingFace Python implementation and tested on iphone 17 pro max.

Architecture:

mel [1, T, 128] → SubSampleConvProjection (4x reduction)
  → 12 Conformer blocks (chunked local attention, depthwise conv1d, cumulative group norm)
  → Temporal reduction (4x)
  → AudioEmbedder (1536 → 2048)
  → LM token stream

New files:

  • Gemma3nAudio.swift - Full Conformer encoder: chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub sample conv projection
  • Gemma3nAudioConfig.swift - 28 audio encoder configuration parameters with Codable
  • Gemma3nVLM.swift - Top level VLM wrapper with audio embedding injection and multimodal embedder
  • Gemma3nAudioTests.swift - Configuration decoding and encoder shape tests
  • Factory registration in VLMModelFactory.swift

Updated Files:

  • Gemma3nAudio.swift - split attention, norm, conv, and encoder classes into separate files
  • Gemma3nAudioAttention.swift: relative position embeddings, chunked attention
  • Gemma3nAudioNorm.swift: cumulative group normalization
  • Gemma3nAudioConv.swift: SSCP conv blocks, subsampling projection
  • Gemma3nAudio.swift: conformer blocks, top-level encoder
  • Gemma3nVLM.swift: preconditionFailure on unimplemented processor stub

Tests:

Test Suite 'Selected tests' started at 2026-04-15 11:32:48.538.
Test Suite 'mlx-swift-lmPackageTests.xctest' started at 2026-04-15 11:32:48.538.
Test Suite 'Gemma3nAudioTests' started at 2026-04-15 11:32:48.538.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioConfigDecoding]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioConfigDecoding]' passed (0.001 seconds).
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioEncoderShapes]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testAudioEncoderShapes]' passed (0.054 seconds).
Test Case '-[MLXLMTests.Gemma3nAudioTests testMultimodalConfigDecoding]' started.
Test Case '-[MLXLMTests.Gemma3nAudioTests testMultimodalConfigDecoding]' passed (0.000 seconds).
Test Suite 'Gemma3nAudioTests' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.055) seconds
Test Suite 'mlx-swift-lmPackageTests.xctest' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.055) seconds
Test Suite 'Selected tests' passed at 2026-04-15 11:32:48.593.
	 Executed 3 tests, with 0 failures (0 unexpected) in 0.055 (0.056) seconds
Audio encoder output shape: [1, 3, 1536]
Audio encoder mask shape: [1, 3]
◇ Test run started.
↳ Testing Library Version: 1501Target Platform: arm64e-apple-macos14.0Test run with 0 tests in 0 suites passed after 0.001 seconds.

iphone 17 pro max results: 15.8s audio → 99 tokens in 0.48s (Apple Neural Engine)

Note: The processor includes a stub for mel spectrogram preprocessing. A follow up PR can add the full audio processor pipeline (WAV → mel → encoder). The encoder itself is complete and tested.

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Copy link
Copy Markdown
Author

@vahsaechao vahsaechao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Audio modality in mlx-swift-lm. The Conformer encoder is fully ported and tested on iphone 17 pro max.

@vahsaechao vahsaechao marked this pull request as ready for review April 7, 2026 23:08
@davidkoski
Copy link
Copy Markdown
Collaborator

See also #192, maybe work with @antmanler on combining these on top of #180 (merged into main)? Thank you!

12-layer USM-style Conformer audio encoder for Gemma 3n E4B multimodal models.
Enables on-device audio encoding: mel spectrogram → Conformer → embedder pipeline.

New files:
- Gemma3nAudio.swift: Full Conformer port (chunked local attention, depthwise conv1d,
  cumulative group norm, temporal reduction, sub-sample conv projection)
- Gemma3nAudioConfig.swift: 28 audio encoder configuration parameters
- Gemma3nVLM.swift: Top-level VLM wrapper with audio embedding injection
- Gemma3nAudioTests.swift: Configuration decoding tests

Architecture: mel [1,T,128] → SubSampleConv (4x) → 12 Conformer blocks →
temporal reduction (4x) → AudioEmbedder (1536→2048) → LM token stream

Tested on iPhone: 15.8s audio → 99 tokens in 0.48s
@vahsaechao vahsaechao force-pushed the model/mlx-swift-gemma3n-e4b branch from d77ea77 to 6eb0f22 Compare April 15, 2026 08:54
@vahsaechao
Copy link
Copy Markdown
Author

Thanks @davidkoski, I've synced with @antmanler on #192 and we agreed to merge both PRs independently and consolidate the audio extractor PR as a followup. Rebased on main.

JaeminKim-amoz

This comment was marked as spam.

  - Gemma3nAudioAttention.swift: relative position embeddings, chunked attention
  - Gemma3nAudioNorm.swift: cumulative group normalization
  - Gemma3nAudioConv.swift: SSCP conv blocks, subsampling projection
  - Gemma3nAudio.swift: conformer blocks, top-level encoder
  - Gemma3nVLM.swift: preconditionFailure on unimplemented processor stub
@vahsaechao
Copy link
Copy Markdown
Author

COMMENT — one guard needed

Impressive first audio modality contribution. Thorough code comments explaining the math. Clean separation of audio config, encoder, and VLM wrapper.

Requested fixes:

  • [MEDIUM] Processor contains a stub for mel spectrogram preprocessing — please add a preconditionFailure or fatalError guard on the stub path to prevent silent misuse before the follow-up PR lands
  • [MEDIUM] Gemma3nAudio.swift at 991 lines — consider splitting attention, norm, conv, and encoder classes into separate files
  • [LOW] No integration test that loads real weights

Thanks for the review! Addressed all three items in the latest push:

[MEDIUM] Processor stub guard. Replaced the silent no-op in Gemma3nAudioVLMProcessor.prepare() with preconditionFailure.

[MEDIUM] File Split Gemma3nAudio.swift:

  • Gemma3nAudioAttention.swift relative position embeddings, chunked attention
  • Gemma3nAudioNorm.swift cumulative group normalization
  • Gemma3nAudioConv.swift SSCP conv blocks, subsampling projection
  • Gemma3nAudio.swift — conformer blocks, top-level encoder

[LOW] Tests — All 3 existing Swift tests pass after the split
Encoder output shape verified: [1, 3, 1536] — correct after 16x temporal reduction.

@davidkoski
Copy link
Copy Markdown
Collaborator

OK, I think we probably need to make it through #194 first then -- we need the basic support for audio added (e.g. in UserInput).

@davidkoski davidkoski mentioned this pull request May 18, 2026
4 tasks
@davidkoski
Copy link
Copy Markdown
Collaborator

See also #298

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants