QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback) by pratiknarola-t · Pull Request #74 · tetherto/qvac-ext-lib-whisper.cpp

pratiknarola-t · 2026-06-30T10:09:32Z

What

Migrate the Parakeet engine from direct single-backend ggml_backend_graph_compute to a shared ggml_backend_sched with the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU backend cannot run — the same mechanism that makes the fabric llama.cpp stack robust (analysis: docs/parakeet-gpu-failure-cpu-fallback-analysis.md).

On the currently-supported backends every op is supported, so the scheduler runs everything on the GPU (1 split, 0 copies) — behaviour is unchanged today; the win is automatic per-op fallback for future/unsupported ops, and it subsumes hand-coded routes under one general mechanism.

Changes (parakeet-cpp only; no ggml change)

Add a per-model ggml_backend_sched over [active, CPU] (op_offload=false), created at load, freed before the backends it references.
Flag every encoder graph input (mel / masks / pe / att_mask / pre_encode) with ggml_set_input so the scheduler keeps them allocated for post-alloc upload.
run_encoder / run_encoder_bypass_pre_encode / run_subsampling: replace the per-graph ggml_gallocr with sched_reset (at the head of each call) → alloc_graph → graph_compute. Outputs are still downloaded to host before the next reset (the existing host round-trip means decoders are fully decoupled — no persistence mechanism needed).
Sortformer head runs through the scheduler, except the Mali-Vulkan force-CPU correctness route, which still computes directly on the CPU backend with the CPU-resident weights (the scheduler would route those ops back to the GPU and reproduce the block-0 NaN, since supports_op returns true there).

Scope

The TDT autoregressive decoder is intentionally left on direct compute — it already routes its only unsupported op (ARGMAX) to host, and its per-token persistent-state design is out of scope here. It rides the migrated shared encoder.

Verification — byte-identical to the pre-change CPU output on all 5 backends

Golden = pre-change CPU transcript/diarization per model; each backend compared to it.

Backend	CTC	TDT	EOU	Sortformer	AOSC streaming
macOS CPU	✅	✅	✅	✅	✅
macOS Metal	✅	✅	✅	✅	✅
Android CPU	✅	✅	✅	✅	✅
Android OpenCL (Adreno 740)	✅	✅	✅	✅	✅
Android Vulkan (Adreno 740)	✅	✅	✅	✅	✅

Audio: jfk.wav (CTC/TDT/EOU), diarization-sample-16k.wav (Sortformer + AOSC streaming). Android device: Adreno 740 (SD 8 Gen 2), built against ggml speech.

github-actions · 2026-06-30T10:09:50Z

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

pratiknarola-t · 2026-06-30T14:45:00Z

Cached encoder graph poisoned by real scheduler fallback — FIXED

Confirmed at the source: ggml_backend_sched_split_graph() rewrites node->src[j] in place on the caller's graph (ggml-backend.cpp:1370) to a copy tensor allocated in sched->ctx, and sched->ctx is freed + recreated at the head of the next split (:1026-1028). Because run_encoder / run_encoder_bypass_pre_encode reuse a cached g.cgraph, the first real per-op CPU fallback leaves the cached graph pointing at freed copies on the next run. Latent today only because every op is supported on all shipping backends (1 split, 0 cross-backend copies) — but it breaks exactly the fallback path this PR exists to enable.

Fix: snapshot every compute node's src pointers when the graph is built, and restore them before each allocation. Keyed by node pointer, not array index — Metal and Vulkan graph_optimize reorder cgraph->nodes[] in place, so an index-keyed restore would target the wrong node after the first run.

Verified (not assumed): a temporary harness pinned one interior encoder node to CPU via ggml_backend_sched_set_tensor_backend (GGML_SCHED_DEBUG=2 confirmed a real GPU→CPU→GPU 3-split with cross-backend copies), then reused the cached graph 20× (streaming simulation):

without the fix: run 2+ produce wrong/empty transcripts (max logit drift ≈ 64 = full garbage);
with the fix: all 20 runs produce the correct transcript. Run-to-run logit drift is bounded and non-compounding at ≈ 0.5–1.3 — about 1–2 % of the inherent Metal-vs-CPU logit gap (≈ 64), i.e. benign mixed-path FP variance, not corruption. In production this path never executes (all backends: 1 split / 0 copies), so output stays byte-deterministic.

Lower-risk notes

Duplicate return 13 — fixed; the sched-allocation failure now has its own code.
Missing ggml_set_output in run_subsampling — added.
Destructor order — agreed, not a bug (at teardown the sched isn't mid-run and ggml_backend_sched_free doesn't dereference caller-graph tensors); left as-is.
Sortformer force-CPU reset — intentional lifecycle cleanup; left as-is.

Regression matrix (all == pre-change CPU golden)

CTC / TDT / EOU / Sortformer on Mac CPU, Mac Metal, Android CPU, Android OpenCL (Adreno 740), Android Vulkan (Adreno 740). AOSC streaming (run_subsampling + bypass) PASS on CPU and Metal GPU.

Net change: parakeet_ctc.cpp +59/−1, parakeet-cpp only, no ggml change.

…r-op CPU fallback) Migrate the Parakeet encoder, subsampling, and Sortformer head from direct single-backend ggml_backend_graph_compute to a shared ggml_backend_sched with the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU backend cannot run (the mechanism that makes the fabric llama.cpp stack robust). - Add a per-model ggml_backend_sched over [active, CPU] (op_offload=false), created at load and freed before the backends it references. - Flag every encoder graph input (mel / masks / PE / att_mask / pre_encode) with ggml_set_input so the scheduler keeps them allocated for post-alloc upload. - run_encoder / run_encoder_bypass_pre_encode / run_subsampling: replace the per-graph gallocr with sched reset (at the head) -> alloc -> compute; outputs are still downloaded to host before the next reset. - Sortformer head runs through the sched, except the Mali-Vulkan force-CPU correctness route, which still computes directly on the CPU backend (the scheduler would route those ops back to the GPU and reproduce the block-0 NaN). The TDT autoregressive decoder is intentionally left on direct compute (it already routes its only unsupported op, ARGMAX, to host). Verified byte-identical to the pre-change CPU output for CTC / TDT / EOU / Sortformer + AOSC streaming on CPU, Metal, Android CPU, Android OpenCL (Adreno 740) and Android Vulkan (Adreno 740).

… fallback The encoder graph is cached and reused across runs, but ggml_backend_sched rewrites node->src[j] in place when a per-op CPU fallback inserts a cross-backend copy. That copy lives in the scheduler's per-run context, which is freed at the head of the next allocation, so reusing the cached graph after a real fallback dereferences freed copies (latent today: every op is supported on all shipping backends, so the scheduler produces one split and no copies). Snapshot each compute node's source pointers when the graph is built and restore them before each allocation. Keyed by node pointer, not array index, so it is unaffected by backends (Metal, Vulkan) that reorder the node array in place during graph optimization. Also give the sched-allocation failure its own error code (was a duplicate) and mark run_subsampling's output with ggml_set_output for consistency.

…force-CPU safety) The Sortformer force-CPU path (the Mali-Vulkan miscompute workaround) allocated and computed on the caller-supplied backend, so a caller passing the active GPU backend (as test_sortformer_parity did) would defeat the workaround on Mali and drive the CPU-resident head weights through the GPU. Both production engine callers passed the correct backend, but the contract was a footgun. Resolve the head backend internally via model_sortformer_backend(model) (CPU on Mali-Vulkan, the active backend otherwise) and drop the caller-supplied backend parameter from sortformer_diarize_ggml and sortformer_aosc_step so the contract cannot be violated. Make model_sortformer_backend const and add an internal null-backend guard; delete the now-orphaned caller locals.

pratiknarola-t requested review from a team as code owners June 30, 2026 10:09

pratiknarola-t force-pushed the QVAC-18192-parakeet-sched branch from 5aea69e to ef54b4a Compare June 30, 2026 12:33

pratiknarola-t force-pushed the QVAC-18192-parakeet-sched branch from c95b4a0 to e39a9cd Compare June 30, 2026 14:46

pratiknarola-t added 3 commits June 30, 2026 20:59

pratiknarola-t force-pushed the QVAC-18192-parakeet-sched branch from 239f3e5 to 6420f0c Compare June 30, 2026 15:29

pratiknarola-t mentioned this pull request Jun 30, 2026

[DO NOT MERGE] transcription-parakeet: CI overlay → parakeet-cpp @ PR #74 (sched CPU-fallback) tetherto/qvac#2966

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74

QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74
pratiknarola-t wants to merge 3 commits into
masterfrom
QVAC-18192-parakeet-sched

pratiknarola-t commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

pratiknarola-t commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

pratiknarola-t commented Jun 30, 2026

What

Changes (parakeet-cpp only; no ggml change)

Scope

Verification — byte-identical to the pre-change CPU output on all 5 backends

Uh oh!

github-actions Bot commented Jun 30, 2026

Review Status

Uh oh!

pratiknarola-t commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cached encoder graph poisoned by real scheduler fallback — FIXED

Lower-risk notes

Regression matrix (all == pre-change CPU golden)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pratiknarola-t commented Jun 30, 2026 •

edited

Loading