whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905
whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905sbryngelson wants to merge 1 commit into
Conversation
Runs the Whisper encoder directly on the Apple Neural Engine via ANEForge, an
alternative to the CoreML encoder backend. About 2x faster than the CoreML encoder
and faster than the Metal GPU encoder on M-series, at ~5x lower energy, with the
same transcripts (cosine 0.999 vs the reference encoder).
Mirrors the existing CoreML/OpenVINO seam: fills embd_enc from the external aneforge
package via the same whisper_encode_external path, gated at runtime by the
ANEFORGE_ENCODER env var. With the variable unset the build behaves exactly like
stock whisper.cpp (the backend dlopens its dispatch dylib only when enabled, no new
link-time dependency). Encoder only; the decoder is untouched.
Adds src/aneforge/whisper-aneforge.{h,cpp}, one line to src/CMakeLists.txt, and ~30
lines to src/whisper.cpp (include, state field, encode branch, init, free).
|
Quite interesting work.
Have you investigated how the number of tiles affects the performance? Is 3 tiles always the best option for all model sizes and hardware? |
The count isn't fixed at 3 @ggerganov, sharp question and some learning for both of us here. It's n_tiles = max(1, (S+256)//512), so it scales with the query length, and the encoder's S=1500 gets 3. I swept it on an M5 Pro across the model sizes (S=1500, head_dim=64, n_head 6 to 20), timing the attention core as one fused program: 3 is close, but not optimal: 5 is consistently 5-10% faster here. And the first-order effect is just tiling at all. At high head count, the untiled [H,1500,1500] score is about 3x slower (it stops pipelining on the engine), while for tiny and base, it fits in cache and tiling is roughly flat. The optimum is broad (3 to 8 tiles within 10%), and it's bounded by L2 residency and pipelining, so it moves with the chip, as you noticed. Rather than a constant, the tile count is now autotuned in ANEForge (I merged a PR for this): it measures the best count once per (chip, S, T, heads, head_dim), caches it, and falls back to the heuristic until a tuned value exists. On the M5 Pro, it picks 5; on the M1/M2, it may pick something else, which is fine. The autotune picks that up with no code change. |
As offered in #3903, this adds an optional encoder backend that runs the Whisper encoder directly on the Apple Neural Engine through ANEForge, as an alternative to the CoreML encoder. On M-series it is about 2x faster than the CoreML encoder and faster than the Metal GPU encoder, at ~5x lower energy, with the same transcripts.
Per-call encoder latency, trained checkpoints, measured with
whisper-bench(CoreML and Metal are the existing backends, fp16):Cosine 0.999 against the reference encoder; the transcript of
jfk.wavis unchanged across tiny/small/medium, and a 199 s real-speech clip matches the stock Metal encoder to within 3 words out of 370 (the expected variation between two fp16 encoders).How it hooks in
It follows the existing CoreML/OpenVINO seam exactly. The backend fills
embd_encthrough the samewhisper_encode_externalpath, gated at runtime by theANEFORGE_ENCODERenv var:src/aneforge/whisper-aneforge.{h,cpp}(the backend), one line insrc/CMakeLists.txt, and ~30 lines insrc/whisper.cpp(an include, a state field, the encode branch, init, and free).ANEFORGE_ENCODERunset the build behaves exactly like stock whisper.cpp. The backenddlopens the ANEForge dispatch dylib only when enabled, so there is no new link-time dependency and nothing changes for existing users.Notes / caveats
aneforgepackage (pip), which reaches the ANE through Apple private Espressoe5rtsymbols, the same stack CoreML uses internally. Undocumented, can change with macOS, so this is a research path, gated behind the env var.n_audio_ctx=1500, so it does not honor--audio-ctx.master(verifiedwhisper-cli).Export script, the full benchmark, and reproduction steps are in the companion repo: https://github.com/sbryngelson/whisper-aneforge
Happy to adjust the seam, naming, or gating to fit how you would prefer an optional backend to land.