Skip to content

QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74

Open
pratiknarola-t wants to merge 3 commits into
masterfrom
QVAC-18192-parakeet-sched
Open

QVAC-18192 parakeet-cpp: route compute through ggml_backend_sched (per-op CPU fallback)#74
pratiknarola-t wants to merge 3 commits into
masterfrom
QVAC-18192-parakeet-sched

Conversation

@pratiknarola-t

Copy link
Copy Markdown

What

Migrate the Parakeet engine from direct single-backend ggml_backend_graph_compute to a shared ggml_backend_sched with the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU backend cannot run — the same mechanism that makes the fabric llama.cpp stack robust (analysis: docs/parakeet-gpu-failure-cpu-fallback-analysis.md).

On the currently-supported backends every op is supported, so the scheduler runs everything on the GPU (1 split, 0 copies) — behaviour is unchanged today; the win is automatic per-op fallback for future/unsupported ops, and it subsumes hand-coded routes under one general mechanism.

Changes (parakeet-cpp only; no ggml change)

  • Add a per-model ggml_backend_sched over [active, CPU] (op_offload=false), created at load, freed before the backends it references.
  • Flag every encoder graph input (mel / masks / pe / att_mask / pre_encode) with ggml_set_input so the scheduler keeps them allocated for post-alloc upload.
  • run_encoder / run_encoder_bypass_pre_encode / run_subsampling: replace the per-graph ggml_gallocr with sched_reset (at the head of each call) → alloc_graphgraph_compute. Outputs are still downloaded to host before the next reset (the existing host round-trip means decoders are fully decoupled — no persistence mechanism needed).
  • Sortformer head runs through the scheduler, except the Mali-Vulkan force-CPU correctness route, which still computes directly on the CPU backend with the CPU-resident weights (the scheduler would route those ops back to the GPU and reproduce the block-0 NaN, since supports_op returns true there).

Scope

The TDT autoregressive decoder is intentionally left on direct compute — it already routes its only unsupported op (ARGMAX) to host, and its per-token persistent-state design is out of scope here. It rides the migrated shared encoder.

Verification — byte-identical to the pre-change CPU output on all 5 backends

Golden = pre-change CPU transcript/diarization per model; each backend compared to it.

Backend CTC TDT EOU Sortformer AOSC streaming
macOS CPU
macOS Metal
Android CPU
Android OpenCL (Adreno 740)
Android Vulkan (Adreno 740)

Audio: jfk.wav (CTC/TDT/EOU), diarization-sample-16k.wav (Sortformer + AOSC streaming). Android device: Adreno 740 (SD 8 Gen 2), built against ggml speech.

@pratiknarola-t pratiknarola-t requested review from a team as code owners June 30, 2026 10:09
@github-actions

Copy link
Copy Markdown

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

@pratiknarola-t pratiknarola-t force-pushed the QVAC-18192-parakeet-sched branch from 5aea69e to ef54b4a Compare June 30, 2026 12:33
@pratiknarola-t

pratiknarola-t commented Jun 30, 2026

Copy link
Copy Markdown
Author

Cached encoder graph poisoned by real scheduler fallback — FIXED

Confirmed at the source: ggml_backend_sched_split_graph() rewrites node->src[j] in place on the caller's graph (ggml-backend.cpp:1370) to a copy tensor allocated in sched->ctx, and sched->ctx is freed + recreated at the head of the next split (:1026-1028). Because run_encoder / run_encoder_bypass_pre_encode reuse a cached g.cgraph, the first real per-op CPU fallback leaves the cached graph pointing at freed copies on the next run. Latent today only because every op is supported on all shipping backends (1 split, 0 cross-backend copies) — but it breaks exactly the fallback path this PR exists to enable.

Fix: snapshot every compute node's src pointers when the graph is built, and restore them before each allocation. Keyed by node pointer, not array index — Metal and Vulkan graph_optimize reorder cgraph->nodes[] in place, so an index-keyed restore would target the wrong node after the first run.

Verified (not assumed): a temporary harness pinned one interior encoder node to CPU via ggml_backend_sched_set_tensor_backend (GGML_SCHED_DEBUG=2 confirmed a real GPU→CPU→GPU 3-split with cross-backend copies), then reused the cached graph 20× (streaming simulation):

  • without the fix: run 2+ produce wrong/empty transcripts (max logit drift ≈ 64 = full garbage);
  • with the fix: all 20 runs produce the correct transcript. Run-to-run logit drift is bounded and non-compounding at ≈ 0.5–1.3 — about 1–2 % of the inherent Metal-vs-CPU logit gap (≈ 64), i.e. benign mixed-path FP variance, not corruption. In production this path never executes (all backends: 1 split / 0 copies), so output stays byte-deterministic.

Lower-risk notes

  • Duplicate return 13 — fixed; the sched-allocation failure now has its own code.
  • Missing ggml_set_output in run_subsampling — added.
  • Destructor order — agreed, not a bug (at teardown the sched isn't mid-run and ggml_backend_sched_free doesn't dereference caller-graph tensors); left as-is.
  • Sortformer force-CPU reset — intentional lifecycle cleanup; left as-is.

Regression matrix (all == pre-change CPU golden)

CTC / TDT / EOU / Sortformer on Mac CPU, Mac Metal, Android CPU, Android OpenCL (Adreno 740), Android Vulkan (Adreno 740). AOSC streaming (run_subsampling + bypass) PASS on CPU and Metal GPU.

Net change: parakeet_ctc.cpp +59/−1, parakeet-cpp only, no ggml change.

@pratiknarola-t pratiknarola-t force-pushed the QVAC-18192-parakeet-sched branch from c95b4a0 to e39a9cd Compare June 30, 2026 14:46
…r-op CPU fallback)

Migrate the Parakeet encoder, subsampling, and Sortformer head from direct
single-backend ggml_backend_graph_compute to a shared ggml_backend_sched with
the CPU backend last, giving genuine per-op CPU fallback for ops the active GPU
backend cannot run (the mechanism that makes the fabric llama.cpp stack robust).

- Add a per-model ggml_backend_sched over [active, CPU] (op_offload=false),
  created at load and freed before the backends it references.
- Flag every encoder graph input (mel / masks / PE / att_mask / pre_encode)
  with ggml_set_input so the scheduler keeps them allocated for post-alloc upload.
- run_encoder / run_encoder_bypass_pre_encode / run_subsampling: replace the
  per-graph gallocr with sched reset (at the head) -> alloc -> compute; outputs
  are still downloaded to host before the next reset.
- Sortformer head runs through the sched, except the Mali-Vulkan force-CPU
  correctness route, which still computes directly on the CPU backend (the
  scheduler would route those ops back to the GPU and reproduce the block-0 NaN).

The TDT autoregressive decoder is intentionally left on direct compute (it
already routes its only unsupported op, ARGMAX, to host).

Verified byte-identical to the pre-change CPU output for CTC / TDT / EOU /
Sortformer + AOSC streaming on CPU, Metal, Android CPU, Android OpenCL
(Adreno 740) and Android Vulkan (Adreno 740).
… fallback

The encoder graph is cached and reused across runs, but ggml_backend_sched
rewrites node->src[j] in place when a per-op CPU fallback inserts a
cross-backend copy. That copy lives in the scheduler's per-run context, which
is freed at the head of the next allocation, so reusing the cached graph after
a real fallback dereferences freed copies (latent today: every op is supported
on all shipping backends, so the scheduler produces one split and no copies).

Snapshot each compute node's source pointers when the graph is built and
restore them before each allocation. Keyed by node pointer, not array index,
so it is unaffected by backends (Metal, Vulkan) that reorder the node array
in place during graph optimization.

Also give the sched-allocation failure its own error code (was a duplicate)
and mark run_subsampling's output with ggml_set_output for consistency.
…force-CPU safety)

The Sortformer force-CPU path (the Mali-Vulkan miscompute workaround) allocated and
computed on the caller-supplied backend, so a caller passing the active GPU backend
(as test_sortformer_parity did) would defeat the workaround on Mali and drive the
CPU-resident head weights through the GPU. Both production engine callers passed the
correct backend, but the contract was a footgun.

Resolve the head backend internally via model_sortformer_backend(model) (CPU on
Mali-Vulkan, the active backend otherwise) and drop the caller-supplied backend
parameter from sortformer_diarize_ggml and sortformer_aosc_step so the contract
cannot be violated. Make model_sortformer_backend const and add an internal
null-backend guard; delete the now-orphaned caller locals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant