whisper : optional ANEForge encoder backend (Apple Neural Engine) by sbryngelson · Pull Request #3905 · ggml-org/whisper.cpp

sbryngelson · 2026-06-23T00:24:50Z

As offered in #3903, this adds an optional encoder backend that runs the Whisper encoder directly on the Apple Neural Engine through ANEForge, as an alternative to the CoreML encoder. On M-series it is about 2x faster than the CoreML encoder and faster than the Metal GPU encoder, at ~5x lower energy, with the same transcripts.

Per-call encoder latency, trained checkpoints, measured with whisper-bench (CoreML and Metal are the existing backends, fp16):

model	ANEForge (ANE)	CoreML (ANE)	Metal (GPU)	vs CoreML	vs Metal
tiny	5.7 ms	11.2 ms	7.3 ms	2.0x	1.3x
base	12.2 ms	23.0 ms	13.5 ms	1.9x	1.1x
small	40.3 ms	77.2 ms	40.9 ms	1.9x	1.0x
medium	117.6 ms	236.0 ms	120.0 ms	2.0x	1.0x

Cosine 0.999 against the reference encoder; the transcript of jfk.wav is unchanged across tiny/small/medium, and a 199 s real-speech clip matches the stock Metal encoder to within 3 words out of 370 (the expected variation between two fp16 encoders).

How it hooks in

It follows the existing CoreML/OpenVINO seam exactly. The backend fills embd_enc through the same whisper_encode_external path, gated at runtime by the ANEFORGE_ENCODER env var:

src/aneforge/whisper-aneforge.{h,cpp} (the backend), one line in src/CMakeLists.txt, and ~30 lines in src/whisper.cpp (an include, a state field, the encode branch, init, and free).
With ANEFORGE_ENCODER unset the build behaves exactly like stock whisper.cpp. The backend dlopens the ANEForge dispatch dylib only when enabled, so there is no new link-time dependency and nothing changes for existing users.
Encoder only; the autoregressive decoder is untouched.

ANEFORGE_DYLIB=... ANEFORGE_ENCODER=./whisper-small-ane \
  ./build/bin/whisper-cli -m models/ggml-small.bin -f samples/jfk.wav

Notes / caveats

Apple Silicon only. The ANE work lives in the external aneforge package (pip), which reaches the ANE through Apple private Espresso e5rt symbols, the same stack CoreML uses internally. Undocumented, can change with macOS, so this is a research path, gated behind the env var.
The exported encoder bundle is built on the target Mac at load (~0.6 s) since it is keyed to the OS build; it is fixed to n_audio_ctx=1500, so it does not honor --audio-ctx.
Builds clean on current master (verified whisper-cli).

Export script, the full benchmark, and reproduction steps are in the companion repo: https://github.com/sbryngelson/whisper-aneforge

Happy to adjust the seam, naming, or gating to fit how you would prefer an optional backend to land.

Runs the Whisper encoder directly on the Apple Neural Engine via ANEForge, an alternative to the CoreML encoder backend. About 2x faster than the CoreML encoder and faster than the Metal GPU encoder on M-series, at ~5x lower energy, with the same transcripts (cosine 0.999 vs the reference encoder). Mirrors the existing CoreML/OpenVINO seam: fills embd_enc from the external aneforge package via the same whisper_encode_external path, gated at runtime by the ANEFORGE_ENCODER env var. With the variable unset the build behaves exactly like stock whisper.cpp (the backend dlopens its dispatch dylib only when enabled, no new link-time dependency). Encoder only; the decoder is untouched. Adds src/aneforge/whisper-aneforge.{h,cpp}, one line to src/CMakeLists.txt, and ~30 lines to src/whisper.cpp (include, state field, encode branch, init, free).

ggerganov · 2026-06-23T08:49:00Z

Quite interesting work.

Query-tiled attention. The score matrix is never fully materialized. The query axis is split into tiles, so each head computes [S, S/3] score tiles rather than the full [S, S] (the flash-attention idea, written as einsum so there is no transpose). This roughly halves attention time, is exact (each tile attends to all keys), and is what puts the ANE ahead of the GPU rather than even with it.

Have you investigated how the number of tiles affects the performance? Is 3 tiles always the best option for all model sizes and hardware?

sbryngelson · 2026-06-23T16:11:58Z

Quite interesting work.

Query-tiled attention. The score matrix is never fully materialized. The query axis is split into tiles, so each head computes [S, S/3] score tiles rather than the full [S, S] (the flash-attention idea, written as einsum so there is no transpose). This roughly halves attention time, is exact (each tile attends to all keys), and is what puts the ANE ahead of the GPU rather than even with it.

Have you investigated how the number of tiles affects the performance? Is 3 tiles always the best option for all model sizes and hardware?

The count isn't fixed at 3 @ggerganov, sharp question and some learning for both of us here. It's n_tiles = max(1, (S+256)//512), so it scales with the query length, and the encoder's S=1500 gets 3. I swept it on an M5 Pro across the model sizes (S=1500, head_dim=64, n_head 6 to 20), timing the attention core as one fused program:

size    H | t=1  t=2  t=3  t=4  t=5  t=6  t=8   (ms)
tiny    6 | 2.50 2.65 2.57 2.54 2.41 2.68 2.51
base    8 | 3.28 3.47 3.33 3.33 3.06 3.51 3.17
small  12 | 5.07 5.45 5.12 5.15 4.90 5.22 5.00
medium 16 |20.31 7.16 6.84 7.12 6.52 6.91 6.59
large  20 |26.83 8.99 8.77 8.60 8.03 8.59 8.20

3 is close, but not optimal: 5 is consistently 5-10% faster here. And the first-order effect is just tiling at all. At high head count, the untiled [H,1500,1500] score is about 3x slower (it stops pipelining on the engine), while for tiny and base, it fits in cache and tiling is roughly flat.

The optimum is broad (3 to 8 tiles within 10%), and it's bounded by L2 residency and pipelining, so it moves with the chip, as you noticed.

Rather than a constant, the tile count is now autotuned in ANEForge (I merged a PR for this): it measures the best count once per (chip, S, T, heads, head_dim), caches it, and falls back to the heuristic until a tuned value exists. On the M5 Pro, it picks 5; on the M1/M2, it may pick something else, which is fine. The autotune picks that up with no code change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905

whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905
sbryngelson wants to merge 1 commit into
ggml-org:masterfrom
sbryngelson:aneforge-encoder-backend

sbryngelson commented Jun 23, 2026

Uh oh!

ggerganov commented Jun 23, 2026

Uh oh!

sbryngelson commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sbryngelson commented Jun 23, 2026

How it hooks in

Notes / caveats

Uh oh!

ggerganov commented Jun 23, 2026

Uh oh!

sbryngelson commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants