Add Gemma 3n E4B audio encoder (Conformer) support#194
Conversation
vahsaechao
left a comment
There was a problem hiding this comment.
Audio modality in mlx-swift-lm. The Conformer encoder is fully ported and tested on iphone 17 pro max.
|
See also #192, maybe work with @antmanler on combining these on top of #180 (merged into main)? Thank you! |
12-layer USM-style Conformer audio encoder for Gemma 3n E4B multimodal models. Enables on-device audio encoding: mel spectrogram → Conformer → embedder pipeline. New files: - Gemma3nAudio.swift: Full Conformer port (chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub-sample conv projection) - Gemma3nAudioConfig.swift: 28 audio encoder configuration parameters - Gemma3nVLM.swift: Top-level VLM wrapper with audio embedding injection - Gemma3nAudioTests.swift: Configuration decoding tests Architecture: mel [1,T,128] → SubSampleConv (4x) → 12 Conformer blocks → temporal reduction (4x) → AudioEmbedder (1536→2048) → LM token stream Tested on iPhone: 15.8s audio → 99 tokens in 0.48s
d77ea77 to
6eb0f22
Compare
|
Thanks @davidkoski, I've synced with @antmanler on #192 and we agreed to merge both PRs independently and consolidate the audio extractor PR as a followup. Rebased on main. |
- Gemma3nAudioAttention.swift: relative position embeddings, chunked attention - Gemma3nAudioNorm.swift: cumulative group normalization - Gemma3nAudioConv.swift: SSCP conv blocks, subsampling projection - Gemma3nAudio.swift: conformer blocks, top-level encoder - Gemma3nVLM.swift: preconditionFailure on unimplemented processor stub
Thanks for the review! Addressed all three items in the latest push: [MEDIUM] Processor stub guard. Replaced the silent no-op in Gemma3nAudioVLMProcessor.prepare() with preconditionFailure. [MEDIUM] File Split
[LOW] Tests — All 3 existing Swift tests pass after the split |
|
OK, I think we probably need to make it through #194 first then -- we need the basic support for audio added (e.g. in UserInput). |
|
See also #298 |
Proposed changes
Adds the audio encoder for Gemma 3n E4B multimodal models. 12-layer USM-style Conformer that converts mel spectrograms into embeddings for the language model.
This is the first audio modality support in mlx-swift-lm. The encoder was ported from the HuggingFace Python implementation and tested on iphone 17 pro max.
Architecture:
New files:
Gemma3nAudio.swift- Full Conformer encoder: chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub sample conv projectionGemma3nAudioConfig.swift- 28 audio encoder configuration parameters with CodableGemma3nVLM.swift- Top level VLM wrapper with audio embedding injection and multimodal embedderGemma3nAudioTests.swift- Configuration decoding and encoder shape testsVLMModelFactory.swiftUpdated Files:
Gemma3nAudio.swift- splitattention,norm,conv, andencoderclasses into separate filesGemma3nAudioAttention.swift: relative position embeddings, chunked attentionGemma3nAudioNorm.swift: cumulative group normalizationGemma3nAudioConv.swift: SSCP conv blocks, subsampling projectionGemma3nAudio.swift: conformer blocks, top-level encoderGemma3nVLM.swift: preconditionFailure on unimplemented processor stubTests:
iphone 17 pro max results: 15.8s audio → 99 tokens in 0.48s (Apple Neural Engine)
Note: The processor includes a stub for mel spectrogram preprocessing. A follow up PR can add the full audio processor pipeline (WAV → mel → encoder). The encoder itself is complete and tested.
Checklist
pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes