Skip to content

whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905

Open
sbryngelson wants to merge 1 commit into
ggml-org:masterfrom
sbryngelson:aneforge-encoder-backend
Open

whisper : optional ANEForge encoder backend (Apple Neural Engine)#3905
sbryngelson wants to merge 1 commit into
ggml-org:masterfrom
sbryngelson:aneforge-encoder-backend

Conversation

@sbryngelson

Copy link
Copy Markdown

As offered in #3903, this adds an optional encoder backend that runs the Whisper encoder directly on the Apple Neural Engine through ANEForge, as an alternative to the CoreML encoder. On M-series it is about 2x faster than the CoreML encoder and faster than the Metal GPU encoder, at ~5x lower energy, with the same transcripts.

Per-call encoder latency, trained checkpoints, measured with whisper-bench (CoreML and Metal are the existing backends, fp16):

model ANEForge (ANE) CoreML (ANE) Metal (GPU) vs CoreML vs Metal
tiny 5.7 ms 11.2 ms 7.3 ms 2.0x 1.3x
base 12.2 ms 23.0 ms 13.5 ms 1.9x 1.1x
small 40.3 ms 77.2 ms 40.9 ms 1.9x 1.0x
medium 117.6 ms 236.0 ms 120.0 ms 2.0x 1.0x

Cosine 0.999 against the reference encoder; the transcript of jfk.wav is unchanged across tiny/small/medium, and a 199 s real-speech clip matches the stock Metal encoder to within 3 words out of 370 (the expected variation between two fp16 encoders).

How it hooks in

It follows the existing CoreML/OpenVINO seam exactly. The backend fills embd_enc through the same whisper_encode_external path, gated at runtime by the ANEFORGE_ENCODER env var:

  • src/aneforge/whisper-aneforge.{h,cpp} (the backend), one line in src/CMakeLists.txt, and ~30 lines in src/whisper.cpp (an include, a state field, the encode branch, init, and free).
  • With ANEFORGE_ENCODER unset the build behaves exactly like stock whisper.cpp. The backend dlopens the ANEForge dispatch dylib only when enabled, so there is no new link-time dependency and nothing changes for existing users.
  • Encoder only; the autoregressive decoder is untouched.
ANEFORGE_DYLIB=... ANEFORGE_ENCODER=./whisper-small-ane \
  ./build/bin/whisper-cli -m models/ggml-small.bin -f samples/jfk.wav

Notes / caveats

  • Apple Silicon only. The ANE work lives in the external aneforge package (pip), which reaches the ANE through Apple private Espresso e5rt symbols, the same stack CoreML uses internally. Undocumented, can change with macOS, so this is a research path, gated behind the env var.
  • The exported encoder bundle is built on the target Mac at load (~0.6 s) since it is keyed to the OS build; it is fixed to n_audio_ctx=1500, so it does not honor --audio-ctx.
  • Builds clean on current master (verified whisper-cli).

Export script, the full benchmark, and reproduction steps are in the companion repo: https://github.com/sbryngelson/whisper-aneforge

Happy to adjust the seam, naming, or gating to fit how you would prefer an optional backend to land.

Runs the Whisper encoder directly on the Apple Neural Engine via ANEForge, an
alternative to the CoreML encoder backend. About 2x faster than the CoreML encoder
and faster than the Metal GPU encoder on M-series, at ~5x lower energy, with the
same transcripts (cosine 0.999 vs the reference encoder).

Mirrors the existing CoreML/OpenVINO seam: fills embd_enc from the external aneforge
package via the same whisper_encode_external path, gated at runtime by the
ANEFORGE_ENCODER env var. With the variable unset the build behaves exactly like
stock whisper.cpp (the backend dlopens its dispatch dylib only when enabled, no new
link-time dependency). Encoder only; the decoder is untouched.

Adds src/aneforge/whisper-aneforge.{h,cpp}, one line to src/CMakeLists.txt, and ~30
lines to src/whisper.cpp (include, state field, encode branch, init, free).
@ggerganov

Copy link
Copy Markdown
Member

Quite interesting work.

Query-tiled attention. The score matrix is never fully materialized. The query axis is split into tiles, so each head computes [S, S/3] score tiles rather than the full [S, S] (the flash-attention idea, written as einsum so there is no transpose). This roughly halves attention time, is exact (each tile attends to all keys), and is what puts the ANE ahead of the GPU rather than even with it.

Have you investigated how the number of tiles affects the performance? Is 3 tiles always the best option for all model sizes and hardware?

@sbryngelson

Copy link
Copy Markdown
Author

Quite interesting work.

Query-tiled attention. The score matrix is never fully materialized. The query axis is split into tiles, so each head computes [S, S/3] score tiles rather than the full [S, S] (the flash-attention idea, written as einsum so there is no transpose). This roughly halves attention time, is exact (each tile attends to all keys), and is what puts the ANE ahead of the GPU rather than even with it.

Have you investigated how the number of tiles affects the performance? Is 3 tiles always the best option for all model sizes and hardware?

The count isn't fixed at 3 @ggerganov, sharp question and some learning for both of us here. It's n_tiles = max(1, (S+256)//512), so it scales with the query length, and the encoder's S=1500 gets 3. I swept it on an M5 Pro across the model sizes (S=1500, head_dim=64, n_head 6 to 20), timing the attention core as one fused program:

size    H | t=1  t=2  t=3  t=4  t=5  t=6  t=8   (ms)
tiny    6 | 2.50 2.65 2.57 2.54 2.41 2.68 2.51
base    8 | 3.28 3.47 3.33 3.33 3.06 3.51 3.17
small  12 | 5.07 5.45 5.12 5.15 4.90 5.22 5.00
medium 16 |20.31 7.16 6.84 7.12 6.52 6.91 6.59
large  20 |26.83 8.99 8.77 8.60 8.03 8.59 8.20

3 is close, but not optimal: 5 is consistently 5-10% faster here. And the first-order effect is just tiling at all. At high head count, the untiled [H,1500,1500] score is about 3x slower (it stops pipelining on the engine), while for tiny and base, it fits in cache and tiling is roughly flat.

The optimum is broad (3 to 8 tiles within 10%), and it's bounded by L2 residency and pipelining, so it moves with the chip, as you noticed.

Rather than a constant, the tile count is now autotuned in ANEForge (I merged a PR for this): it measures the best count once per (chip, S, T, heads, head_dim), caches it, and falls back to the heuristic until a tuned value exists. On the M5 Pro, it picks 5; on the M1/M2, it may pick something else, which is fine. The autotune picks that up with no code change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants