tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71
tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71ogad-tether wants to merge 2 commits into
Conversation
✅ macOS / Metal validation (local, M2)Validated the q8-on-GPU path end-to-end via an Engine-level WAV harness (real
All produce real, finite, non-clipped audio with no crash. q8-vs-f32 alignment tracks closely (e.g. ar q8 5.12s/rms .051 vs f32 5.00s/rms .051; ru q8 3.96s/.033 vs f32 4.40s/.034) — consistent with the fix making the align probe read the same q8 keys attention already uses. Added committed coverage (this PR)
Still pending
|
…guard-removal gap) An adversarial audit of PR #71 flagged that fully removing chatterbox_mtl_guard_kv_type deleted the blanket "force f32 on any non-CPU backend" net, so a quantized KV request now reaches ALL GPU backends for the MTL variant. The shared chatterbox_resolve_kv_type only probes flash_attn_ext — NOT the dequantizing ggml_cast(q8_0 strided -> f32) the alignment probe emits every decode step. A GPU backend with thin op coverage (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can advertise q8 flash-attn yet be unable to encode that cast, and because the MTL path runs a single-backend graph_compute (no scheduler fallback) it would SIGABRT at compute — i.e. removing the guard could trade the Metal crash for a crash on another backend. Fix: chatterbox_mtl_resolve_kv_type wraps the shared resolve and additionally probes the strided q8->f32 cast via ggml_backend_supports_op, falling back to f32 only when the backend can't encode it. This is per-backend-correct: Metal (which supports the cast — verified) keeps q8 on the GPU, and any backend lacking the kernel safely degrades to f32 instead of crashing. Replaces the blunt "non-CPU -> f32" guard, which also blocked Metal (the original bug). Validated (stock ggml Metal, M2): q8 MTL on Metal still retains q8 (no fallback, no crash, byte-identical sample count). test_kv_cache_type extended for the new resolve (cpu retains q8 / null -> f32 / f32 stays f32). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GustavoA1604
left a comment
There was a problem hiding this comment.
Also
- change target branch to master since we don't need to merge #70
- squash history since some things were added and then removed
- provide run of passing CI end-to-end
…probe cast)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0) KV cache:
ggml_metal_op_encode_impl: error: unsupported op 'CONT' (in eval_step_mtl)
Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided
K/V cache fine on Metal. Walking the decode graph showed the only
Metal-unsupported op was the per-(layer,head) alignment probe in
build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a
mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path
runs a single-backend graph_compute (no scheduler fallback), so it crashed at
encode time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default)
shipped a broken MTL Metal path undetected.
Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32).
Metal supports a dequantizing copy of a strided quantized view (verified against
ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no
contiguity check, whereas the old q8->q8 cont hit `default: return false`).
For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8
KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn
runs its matmul at f16 internally regardless of KV storage dtype).
Cross-backend safety net: removing the blanket "f32 on any non-CPU backend"
guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now
probes the exact align-probe cast op per-backend and falls back to f32 when the
backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds),
instead of a name/type check. Vulkan quantized K/V stays force-f32'd in
chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into
chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable.
Tests:
- test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the
cross-backend fallback branch (cast unsupported -> f32) via the pure helper.
- test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and
CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe
use, so the sentinel mirrors the real op.
- test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang>
ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail.
- test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to
catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it).
Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro
Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf
is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and
q8 KV is no slower than f16).
Refs QVAC-19557
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
073c7b7 to
12749a0
Compare
|
Thanks for the review — all five inline comments addressed (fixed + replied + resolved). On the three summary points:
Ready for re-review. |
Review StatusCurrent Status: ❌ PENDING Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member. |
…ual gap) Follow-up to the q8-KV-on-GPU change in this PR. The OpenCL (Adreno) backend has the same advertise-vs-actual supports_op gap already guarded for Vulkan: it reports both the q8_0 flash-attn and the align-probe's strided q8->f32 cast as supported, but the driver SIGSEGVs on the quantized cache at model load (clEnqueueWriteBuffer inside tts_cpp::chatterbox::Engine::Engine). Removing the old blanket "f32 on any non-CPU backend" guard (so Metal could run q8 KV) re-exposed q8 as the default on every GPU backend, including Adreno. Device-farm confirmed (QVAC-19557): the multilingual GPU load SIGSEGVs on a Samsung Galaxy S25 Ultra (Adreno) with a q8_0 KV cache, while the identical f16/f32 cache passes and a Pixel 9 (Mali-Vulkan, already force-f32'd above) loads all 10 tests fine. iOS/Metal (validated) and the Vulkan coopmat2 path were already covered; OpenCL/Adreno was the one unguarded GPU family. Mirror the Vulkan guard in chatterbox_resolve_kv_type: force quantized K/V to f32 on OpenCL via backend_is_opencl(). q8 KV stays on Metal (validated); f16/f32 are unaffected. Like the Vulkan guard this is inline (no GPU backend in the linux-x64 cpp-tests); its authoritative test is the Android device-farm E2E (S25 Ultra + Pixel 9). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on Metal with a quantized (q8_0) KV cache (
unsupported op 'CONT'ineval_step_mtl) — reported from the field on macOS Apple Silicon withuseGPU:trueon@qvac/tts-ggml@0.3.x.#2527made q8 KV the default for all variants, exposing a never-Metal-tested MTL path.Root cause (not flash-attention)
flash_attn_extreads the q8 strided K/V cache fine on Metal. Walking the decode graph, the only Metal-unsupported op was the per-(layer,head) alignment probe (build_llama_block), whichggml_cont'd a strided view of the q8 K cache to feed amul_mat. ggml-metal has no quantizedCONTkernel, and the MTL path runs a single-backendgraph_compute(no scheduler fallback), so it crashed at encode. The resolve probe validatedflash_attn_extbut not this CONT.Fix
Replace the same-type
ggml_contwith a dequantizingggml_cast(…→f32)— Metal supports a dequant copy of a strided quantized view (verified vs ggml-metal source:Q8_0→F32takes the supportedCPYpath; the oldq8→q8cont hitdefault: return false). Recovers q8 KV on the GPU — pure memory savings, no compute cost (Metal flash-attn runs its matmul at f16 internally regardless of KV storage dtype).chatterbox_mtl_resolve_kv_typeprobes that exact cast op per-backend and falls back to f32 where a GPU backend can't encode it (thin-op OpenCL/Adreno, Mali-Vulkan) — replacing the blanket "f32 on any non-CPU backend" guard (which also blocked Metal, the whole bug). Vulkan q8 stays force-f32'd (coopmat2).Tests
test_kv_cache_type: resolve pass-through + the cross-backend fallback branch (cast-unsupported → f32) via a pure helper.test_metal_ops(gpu):CAST(q8_0 strided→f32)supported /CONT(q8_0)not — same 2D strided shape as the align probe + resolve probe (all three mirror).test_multilingual_synth:--kv-cache-type+mtl-synth-q8-<lang>variants (en/ar/ru/hi); missing GGUFs → SKIP(77) not fail; explicit flag (no env fallback that could flip f32 baselines).test-eos-roundtrip-q8-kv: CER/ramble round-trip under q8 KV → catches alignment/EOS drift from the dequant.Validation
runChatterboxMtlTest: PASS (2/2), q8 flash-attn kernel confirmed, zero CONTcont→cast, f32/f16)Refs QVAC-19557
🤖 Generated with Claude Code