Commit d77ea77
committed
Add Gemma 3n E4B audio encoder (Conformer) support
12-layer USM-style Conformer audio encoder for Gemma 3n E4B multimodal models.
Enables on-device audio encoding: mel spectrogram → Conformer → embedder pipeline.
New files:
- Gemma3nAudio.swift: Full Conformer port (chunked local attention, depthwise conv1d,
cumulative group norm, temporal reduction, sub-sample conv projection)
- Gemma3nAudioConfig.swift: 28 audio encoder configuration parameters
- Gemma3nVLM.swift: Top-level VLM wrapper with audio embedding injection
- Gemma3nAudioTests.swift: Configuration decoding tests
Architecture: mel [1,T,128] → SubSampleConv (4x) → 12 Conformer blocks →
temporal reduction (4x) → AudioEmbedder (1536→2048) → LM token stream
Tested on iPhone: 15.8s audio → 99 tokens in 0.48s1 parent 8c9dd63 commit d77ea77
5 files changed
Lines changed: 2025 additions & 0 deletions
File tree
- Libraries/MLXVLM
- Models
- Tests/MLXLMTests
0 commit comments