Commit d77ea77

committed

Add Gemma 3n E4B audio encoder (Conformer) support

12-layer USM-style Conformer audio encoder for Gemma 3n E4B multimodal models. Enables on-device audio encoding: mel spectrogram → Conformer → embedder pipeline. New files: - Gemma3nAudio.swift: Full Conformer port (chunked local attention, depthwise conv1d, cumulative group norm, temporal reduction, sub-sample conv projection) - Gemma3nAudioConfig.swift: 28 audio encoder configuration parameters - Gemma3nVLM.swift: Top-level VLM wrapper with audio embedding injection - Gemma3nAudioTests.swift: Configuration decoding tests Architecture: mel [1,T,128] → SubSampleConv (4x) → 12 Conformer blocks → temporal reduction (4x) → AudioEmbedder (1536→2048) → LM token stream Tested on iPhone: 15.8s audio → 99 tokens in 0.48s

1 parent 8c9dd63 commit d77ea77Copy full SHA for d77ea77

5 files changed

Libraries/MLXVLM
- Models
- VLMModelFactory.swift
Tests/MLXLMTests
- Gemma3nAudioTests.swift

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit d77ea77

File tree

0 commit comments