QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74
QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74pratiknarola-t wants to merge 3 commits into
Conversation
Review StatusCurrent Status: ❌ PENDING Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member. |
5aea69e to
ef54b4a
Compare
Cached encoder graph poisoned by real scheduler fallback — FIXEDConfirmed at the source: Fix: snapshot every compute node's Verified (not assumed): a temporary harness pinned one interior encoder node to CPU via
Lower-risk notes
Regression matrix (all == pre-change CPU golden)CTC / TDT / EOU / Sortformer on Mac CPU, Mac Metal, Android CPU, Android OpenCL (Adreno 740), Android Vulkan (Adreno 740). AOSC streaming ( Net change: |
c95b4a0 to
e39a9cd
Compare
…r-op CPU fallback) Migrate the Parakeet encoder, subsampling, and Sortformer head from direct single-backend ggml_backend_graph_compute to a shared ggml_backend_sched with the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU backend cannot run (the mechanism that makes the fabric llama.cpp stack robust). - Add a per-model ggml_backend_sched over [active, CPU] (op_offload=false), created at load and freed before the backends it references. - Flag every encoder graph input (mel / masks / PE / att_mask / pre_encode) with ggml_set_input so the scheduler keeps them allocated for post-alloc upload. - run_encoder / run_encoder_bypass_pre_encode / run_subsampling: replace the per-graph gallocr with sched reset (at the head) -> alloc -> compute; outputs are still downloaded to host before the next reset. - Sortformer head runs through the sched, except the Mali-Vulkan force-CPU correctness route, which still computes directly on the CPU backend (the scheduler would route those ops back to the GPU and reproduce the block-0 NaN). The TDT autoregressive decoder is intentionally left on direct compute (it already routes its only unsupported op, ARGMAX, to host). Verified byte-identical to the pre-change CPU output for CTC / TDT / EOU / Sortformer + AOSC streaming on CPU, Metal, Android CPU, Android OpenCL (Adreno 740) and Android Vulkan (Adreno 740).
… fallback The encoder graph is cached and reused across runs, but ggml_backend_sched rewrites node->src[j] in place when a per-op CPU fallback inserts a cross-backend copy. That copy lives in the scheduler's per-run context, which is freed at the head of the next allocation, so reusing the cached graph after a real fallback dereferences freed copies (latent today: every op is supported on all shipping backends, so the scheduler produces one split and no copies). Snapshot each compute node's source pointers when the graph is built and restore them before each allocation. Keyed by node pointer, not array index, so it is unaffected by backends (Metal, Vulkan) that reorder the node array in place during graph optimization. Also give the sched-allocation failure its own error code (was a duplicate) and mark run_subsampling's output with ggml_set_output for consistency.
…force-CPU safety) The Sortformer force-CPU path (the Mali-Vulkan miscompute workaround) allocated and computed on the caller-supplied backend, so a caller passing the active GPU backend (as test_sortformer_parity did) would defeat the workaround on Mali and drive the CPU-resident head weights through the GPU. Both production engine callers passed the correct backend, but the contract was a footgun. Resolve the head backend internally via model_sortformer_backend(model) (CPU on Mali-Vulkan, the active backend otherwise) and drop the caller-supplied backend parameter from sortformer_diarize_ggml and sortformer_aosc_step so the contract cannot be violated. Make model_sortformer_backend const and add an internal null-backend guard; delete the now-orphaned caller locals.
239f3e5 to
6420f0c
Compare
What
Migrate the Parakeet engine from direct single-backend
ggml_backend_graph_computeto a sharedggml_backend_schedwith the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU backend cannot run — the same mechanism that makes the fabric llama.cpp stack robust (analysis:docs/parakeet-gpu-failure-cpu-fallback-analysis.md).On the currently-supported backends every op is supported, so the scheduler runs everything on the GPU (1 split, 0 copies) — behaviour is unchanged today; the win is automatic per-op fallback for future/unsupported ops, and it subsumes hand-coded routes under one general mechanism.
Changes (parakeet-cpp only; no ggml change)
ggml_backend_schedover[active, CPU](op_offload=false), created at load, freed before the backends it references.mel/ masks /pe/att_mask/pre_encode) withggml_set_inputso the scheduler keeps them allocated for post-alloc upload.run_encoder/run_encoder_bypass_pre_encode/run_subsampling: replace the per-graphggml_gallocrwithsched_reset(at the head of each call) →alloc_graph→graph_compute. Outputs are still downloaded to host before the next reset (the existing host round-trip means decoders are fully decoupled — no persistence mechanism needed).supports_opreturns true there).Scope
The TDT autoregressive decoder is intentionally left on direct compute — it already routes its only unsupported op (ARGMAX) to host, and its per-token persistent-state design is out of scope here. It rides the migrated shared encoder.
Verification — byte-identical to the pre-change CPU output on all 5 backends
Golden = pre-change CPU transcript/diarization per model; each backend compared to it.
Audio:
jfk.wav(CTC/TDT/EOU),diarization-sample-16k.wav(Sortformer + AOSC streaming). Android device: Adreno 740 (SD 8 Gen 2), built against ggmlspeech.