tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast) by ogad-tether · Pull Request #71 · tetherto/qvac-ext-lib-whisper.cpp

ogad-tether · 2026-06-29T12:39:02Z

Problem

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on Metal with a quantized (q8_0) KV cache (unsupported op 'CONT' in eval_step_mtl) — reported from the field on macOS Apple Silicon with useGPU:true on @qvac/tts-ggml@0.3.x. #2527 made q8 KV the default for all variants, exposing a never-Metal-tested MTL path.

Root cause (not flash-attention)

flash_attn_ext reads the q8 strided K/V cache fine on Metal. Walking the decode graph, the only Metal-unsupported op was the per-(layer,head) alignment probe (build_llama_block), which ggml_cont'd a strided view of the q8 K cache to feed a mul_mat. ggml-metal has no quantized CONT kernel, and the MTL path runs a single-backend graph_compute (no scheduler fallback), so it crashed at encode. The resolve probe validated flash_attn_ext but not this CONT.

Fix

Replace the same-type ggml_cont with a dequantizing ggml_cast(…→f32) — Metal supports a dequant copy of a strided quantized view (verified vs ggml-metal source: Q8_0→F32 takes the supported CPY path; the old q8→q8 cont hit default: return false). Recovers q8 KV on the GPU — pure memory savings, no compute cost (Metal flash-attn runs its matmul at f16 internally regardless of KV storage dtype).

chatterbox_mtl_resolve_kv_type probes that exact cast op per-backend and falls back to f32 where a GPU backend can't encode it (thin-op OpenCL/Adreno, Mali-Vulkan) — replacing the blanket "f32 on any non-CPU backend" guard (which also blocked Metal, the whole bug). Vulkan q8 stays force-f32'd (coopmat2).

This PR is now standalone off master and supersedes the f32-fallback stopgap — #70 is closed, and the guard add/remove churn is squashed out.

Tests

test_kv_cache_type: resolve pass-through + the cross-backend fallback branch (cast-unsupported → f32) via a pure helper.
test_metal_ops (gpu): CAST(q8_0 strided→f32) supported / CONT(q8_0) not — same 2D strided shape as the align probe + resolve probe (all three mirror).
test_multilingual_synth: --kv-cache-type + mtl-synth-q8-<lang> variants (en/ar/ru/hi); missing GGUFs → SKIP(77) not fail; explicit flag (no env fallback that could flip f32 baselines).
test-eos-roundtrip-q8-kv: CER/ramble round-trip under q8 KV → catches alignment/EOS drift from the dequant.

Validation

	result
macOS Metal (M2), q8 MTL, es/fr/de/pt	✅ synthesizes, no crash
iOS device (iPhone 17 Pro Max, A19 Pro Metal)	✅ `runChatterboxMtlTest: PASS (2/2)`, q8 flash-attn kernel confirmed, zero CONT
perf (before/after `cont`→`cast`, f32/f16)	within run-to-run noise — change is perf-neutral; q8 ≈ f16

Refs QVAC-19557

🤖 Generated with Claude Code

ogad-tether · 2026-06-29T12:57:04Z

✅ macOS / Metal validation (local, M2)

Validated the q8-on-GPU path end-to-end via an Engine-level WAV harness (real synthesize() → 16-bit WAV → audio stats), with the guard removed:

lang	script	q8 dur	q8 rms	peak	nans/clip
en	Latin	2.80s	.033	.26	0/0
es	Latin	4.20s	.038	.68	0/0
fr	Latin	4.64s	.037	.33	0/0
de	Latin	3.24s	.044	.40	0/0
ar	Arabic	5.12s	.051	.42	0/0
ru	Cyrillic	3.96s	.033	.33	0/0
hi	Devanagari	3.68s	.042	.29	0/0
ko	Hangul	4.68s	.037	.34	0/0

All produce real, finite, non-clipped audio with no crash. q8-vs-f32 alignment tracks closely (e.g. ar q8 5.12s/rms .051 vs f32 5.00s/rms .051; ru q8 3.96s/.033 vs f32 4.40s/.034) — consistent with the fix making the align probe read the same q8 keys attention already uses.

Added committed coverage (this PR)

test_multilingual_synth: --kv-cache-type passthrough (+ CHATTERBOX_KV_CACHE_TYPE env) → mtl-synth-q8-<lang> ctest variants (en/ar/ru/hi). On a Metal fleet these hit the exact pre-fix SIGABRT path and validate the WAV is real audio.
test_metal_ops (gpu): asserts cast(q8_0 strided → f32) is supported on Metal (the op the fix relies on).

Still pending

On-device iOS + macOS run on the real ggml-speech addon build (next).

…guard-removal gap) An adversarial audit of PR #71 flagged that fully removing chatterbox_mtl_guard_kv_type deleted the blanket "force f32 on any non-CPU backend" net, so a quantized KV request now reaches ALL GPU backends for the MTL variant. The shared chatterbox_resolve_kv_type only probes flash_attn_ext — NOT the dequantizing ggml_cast(q8_0 strided -> f32) the alignment probe emits every decode step. A GPU backend with thin op coverage (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can advertise q8 flash-attn yet be unable to encode that cast, and because the MTL path runs a single-backend graph_compute (no scheduler fallback) it would SIGABRT at compute — i.e. removing the guard could trade the Metal crash for a crash on another backend. Fix: chatterbox_mtl_resolve_kv_type wraps the shared resolve and additionally probes the strided q8->f32 cast via ggml_backend_supports_op, falling back to f32 only when the backend can't encode it. This is per-backend-correct: Metal (which supports the cast — verified) keeps q8 on the GPU, and any backend lacking the kernel safely degrades to f32 instead of crashing. Replaces the blunt "non-CPU -> f32" guard, which also blocked Metal (the original bug). Validated (stock ggml Metal, M2): q8 MTL on Metal still retains q8 (no fallback, no crash, byte-identical sample count). test_kv_cache_type extended for the new resolve (cpu retains q8 / null -> f32 / f32 stays f32). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

GustavoA1604

Also

change target branch to master since we don't need to merge #70
squash history since some things were added and then removed
provide run of passing CI end-to-end

…probe cast) The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' (in eval_step_mtl) Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided K/V cache fine on Metal. Walking the decode graph showed the only Metal-unsupported op was the per-(layer,head) alignment probe in build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path runs a single-backend graph_compute (no scheduler fallback), so it crashed at encode time. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32). Metal supports a dequantizing copy of a strided quantized view (verified against ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no contiguity check, whereas the old q8->q8 cont hit `default: return false`). For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8 KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn runs its matmul at f16 internally regardless of KV storage dtype). Cross-backend safety net: removing the blanket "f32 on any non-CPU backend" guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now probes the exact align-probe cast op per-backend and falls back to f32 when the backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds), instead of a name/type check. Vulkan quantized K/V stays force-f32'd in chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable. Tests: - test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the cross-backend fallback branch (cast unsupported -> f32) via the pure helper. - test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe use, so the sentinel mirrors the real op. - test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang> ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail. - test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it). Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and q8 KV is no slower than f16). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ogad-tether · 2026-06-29T14:35:41Z

Thanks for the review — all five inline comments addressed (fixed + replied + resolved). On the three summary points:

Target → master ✅ — base retargeted to master; tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT) #70 closed as superseded.
Squash ✅ — the branch is now one clean commit off master. The guard add→remove churn is squashed out; verified the net diff touches only the 8 q8 files (no chatterbox_mtl_guard_kv_type left, and the lavasr enhancer work on master is untouched — the branch was rebased onto current master).
End-to-end validation — this repo has no on-PR CI (tts-cpp is validated downstream via the @qvac/tts-ggml release), so here's the evidence:
- Local ctest: test_kv_cache_type all pass (incl. the new cross-backend fallback-branch tests); test_metal_ops gpu sentinel ok with the aligned 2D shape.
- macOS Metal (M2): q8 MTL synthesizes across es/fr/de/pt through the real addon + ggml-speech — no CONT crash.
- On-device iOS (iPhone 17 Pro Max, A19 Pro Metal): the addon built from this exact branch → runChatterboxMtlTest: PASS (2/2), q8 flash-attn kernel confirmed on-device, zero CONT crash.
- Perf: before/after cont→cast on f32/f16 is within run-to-run noise (perf-neutral), and q8 ≈ f16 — proven with an A/B where the cont build still crashes on q8 (confirming the swap was real).

Ready for re-review.

github-actions · 2026-06-29T14:35:51Z

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

…ual gap) Follow-up to the q8-KV-on-GPU change in this PR. The OpenCL (Adreno) backend has the same advertise-vs-actual supports_op gap already guarded for Vulkan: it reports both the q8_0 flash-attn and the align-probe's strided q8->f32 cast as supported, but the driver SIGSEGVs on the quantized cache at model load (clEnqueueWriteBuffer inside tts_cpp::chatterbox::Engine::Engine). Removing the old blanket "f32 on any non-CPU backend" guard (so Metal could run q8 KV) re-exposed q8 as the default on every GPU backend, including Adreno. Device-farm confirmed (QVAC-19557): the multilingual GPU load SIGSEGVs on a Samsung Galaxy S25 Ultra (Adreno) with a q8_0 KV cache, while the identical f16/f32 cache passes and a Pixel 9 (Mali-Vulkan, already force-f32'd above) loads all 10 tests fine. iOS/Metal (validated) and the Vulkan coopmat2 path were already covered; OpenCL/Adreno was the one unguarded GPU family. Mirror the Vulkan guard in chatterbox_resolve_kv_type: force quantized K/V to f32 on OpenCL via backend_is_opencl(). q8 KV stays on Metal (validated); f16/f32 are unaffected. Like the Vulkan guard this is inline (no GPU backend in the linux-x64 cpp-tests); its authoritative test is the Android device-farm E2E (S25 Ultra + Pixel 9). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ogad-tether requested review from a team as code owners June 29, 2026 12:39

GustavoA1604 requested changes Jun 29, 2026

View reviewed changes

Comment thread tts-cpp/src/main.cpp Outdated

Comment thread tts-cpp/src/t3_mtl.cpp

Comment thread tts-cpp/test/test_multilingual_synth.cpp Outdated

Comment thread tts-cpp/src/main.cpp

Comment thread tts-cpp/CMakeLists.txt

ogad-tether force-pushed the feat/qvac-19557-mtl-q8-gpu-real-fix branch from 073c7b7 to 12749a0 Compare June 29, 2026 14:32

ogad-tether changed the title ~~tts-cpp: chatterbox-mtl — real q8-on-GPU fix (dequant align probe; drop f32 guard)~~ tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast) Jun 29, 2026

ogad-tether changed the base branch from feat/qvac-19557-mtl-metal-q8-guard to master June 29, 2026 14:33

ogad-tether mentioned this pull request Jun 29, 2026

tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT) #70

Closed

ogad-tether requested a review from GustavoA1604 June 29, 2026 15:03

ogad-tether self-assigned this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71

tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71
ogad-tether wants to merge 2 commits into
masterfrom
feat/qvac-19557-mtl-q8-gpu-real-fix

ogad-tether commented Jun 29, 2026 •

edited

Loading

Uh oh!

ogad-tether commented Jun 29, 2026

Uh oh!

GustavoA1604 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

ogad-tether commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause (not flash-attention)

Fix

Tests

Validation

Uh oh!

ogad-tether commented Jun 29, 2026

✅ macOS / Metal validation (local, M2)

Added committed coverage (this PR)

Still pending

Uh oh!

GustavoA1604 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogad-tether commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Review Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ogad-tether commented Jun 29, 2026 •

edited

Loading