Skip to content

tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71

Open
ogad-tether wants to merge 2 commits into
masterfrom
feat/qvac-19557-mtl-q8-gpu-real-fix
Open

tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)#71
ogad-tether wants to merge 2 commits into
masterfrom
feat/qvac-19557-mtl-q8-gpu-real-fix

Conversation

@ogad-tether

@ogad-tether ogad-tether commented Jun 29, 2026

Copy link
Copy Markdown

Problem

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on Metal with a quantized (q8_0) KV cache (unsupported op 'CONT' in eval_step_mtl) — reported from the field on macOS Apple Silicon with useGPU:true on @qvac/tts-ggml@0.3.x. #2527 made q8 KV the default for all variants, exposing a never-Metal-tested MTL path.

Root cause (not flash-attention)

flash_attn_ext reads the q8 strided K/V cache fine on Metal. Walking the decode graph, the only Metal-unsupported op was the per-(layer,head) alignment probe (build_llama_block), which ggml_cont'd a strided view of the q8 K cache to feed a mul_mat. ggml-metal has no quantized CONT kernel, and the MTL path runs a single-backend graph_compute (no scheduler fallback), so it crashed at encode. The resolve probe validated flash_attn_ext but not this CONT.

Fix

Replace the same-type ggml_cont with a dequantizing ggml_cast(…→f32) — Metal supports a dequant copy of a strided quantized view (verified vs ggml-metal source: Q8_0→F32 takes the supported CPY path; the old q8→q8 cont hit default: return false). Recovers q8 KV on the GPU — pure memory savings, no compute cost (Metal flash-attn runs its matmul at f16 internally regardless of KV storage dtype).

chatterbox_mtl_resolve_kv_type probes that exact cast op per-backend and falls back to f32 where a GPU backend can't encode it (thin-op OpenCL/Adreno, Mali-Vulkan) — replacing the blanket "f32 on any non-CPU backend" guard (which also blocked Metal, the whole bug). Vulkan q8 stays force-f32'd (coopmat2).

This PR is now standalone off master and supersedes the f32-fallback stopgap — #70 is closed, and the guard add/remove churn is squashed out.

Tests

  • test_kv_cache_type: resolve pass-through + the cross-backend fallback branch (cast-unsupported → f32) via a pure helper.
  • test_metal_ops (gpu): CAST(q8_0 strided→f32) supported / CONT(q8_0) not — same 2D strided shape as the align probe + resolve probe (all three mirror).
  • test_multilingual_synth: --kv-cache-type + mtl-synth-q8-<lang> variants (en/ar/ru/hi); missing GGUFs → SKIP(77) not fail; explicit flag (no env fallback that could flip f32 baselines).
  • test-eos-roundtrip-q8-kv: CER/ramble round-trip under q8 KV → catches alignment/EOS drift from the dequant.

Validation

result
macOS Metal (M2), q8 MTL, es/fr/de/pt ✅ synthesizes, no crash
iOS device (iPhone 17 Pro Max, A19 Pro Metal) runChatterboxMtlTest: PASS (2/2), q8 flash-attn kernel confirmed, zero CONT
perf (before/after contcast, f32/f16) within run-to-run noise — change is perf-neutral; q8 ≈ f16

Refs QVAC-19557

🤖 Generated with Claude Code

@ogad-tether ogad-tether requested review from a team as code owners June 29, 2026 12:39
@ogad-tether

Copy link
Copy Markdown
Author

✅ macOS / Metal validation (local, M2)

Validated the q8-on-GPU path end-to-end via an Engine-level WAV harness (real synthesize() → 16-bit WAV → audio stats), with the guard removed:

lang script q8 dur q8 rms peak nans/clip
en Latin 2.80s .033 .26 0/0
es Latin 4.20s .038 .68 0/0
fr Latin 4.64s .037 .33 0/0
de Latin 3.24s .044 .40 0/0
ar Arabic 5.12s .051 .42 0/0
ru Cyrillic 3.96s .033 .33 0/0
hi Devanagari 3.68s .042 .29 0/0
ko Hangul 4.68s .037 .34 0/0

All produce real, finite, non-clipped audio with no crash. q8-vs-f32 alignment tracks closely (e.g. ar q8 5.12s/rms .051 vs f32 5.00s/rms .051; ru q8 3.96s/.033 vs f32 4.40s/.034) — consistent with the fix making the align probe read the same q8 keys attention already uses.

Added committed coverage (this PR)

  • test_multilingual_synth: --kv-cache-type passthrough (+ CHATTERBOX_KV_CACHE_TYPE env) → mtl-synth-q8-<lang> ctest variants (en/ar/ru/hi). On a Metal fleet these hit the exact pre-fix SIGABRT path and validate the WAV is real audio.
  • test_metal_ops (gpu): asserts cast(q8_0 strided → f32) is supported on Metal (the op the fix relies on).

Still pending

  • On-device iOS + macOS run on the real ggml-speech addon build (next).

ogad-tether added a commit that referenced this pull request Jun 29, 2026
…guard-removal gap)

An adversarial audit of PR #71 flagged that fully removing chatterbox_mtl_guard_kv_type
deleted the blanket "force f32 on any non-CPU backend" net, so a quantized KV
request now reaches ALL GPU backends for the MTL variant. The shared
chatterbox_resolve_kv_type only probes flash_attn_ext — NOT the dequantizing
ggml_cast(q8_0 strided -> f32) the alignment probe emits every decode step. A GPU
backend with thin op coverage (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can
advertise q8 flash-attn yet be unable to encode that cast, and because the MTL
path runs a single-backend graph_compute (no scheduler fallback) it would SIGABRT
at compute — i.e. removing the guard could trade the Metal crash for a crash on
another backend.

Fix: chatterbox_mtl_resolve_kv_type wraps the shared resolve and additionally
probes the strided q8->f32 cast via ggml_backend_supports_op, falling back to f32
only when the backend can't encode it. This is per-backend-correct: Metal (which
supports the cast — verified) keeps q8 on the GPU, and any backend lacking the
kernel safely degrades to f32 instead of crashing. Replaces the blunt
"non-CPU -> f32" guard, which also blocked Metal (the original bug).

Validated (stock ggml Metal, M2): q8 MTL on Metal still retains q8 (no fallback,
no crash, byte-identical sample count). test_kv_cache_type extended for the new
resolve (cpu retains q8 / null -> f32 / f32 stays f32).

Refs QVAC-19557

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@GustavoA1604 GustavoA1604 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also

  1. change target branch to master since we don't need to merge #70
  2. squash history since some things were added and then removed
  3. provide run of passing CI end-to-end

Comment thread tts-cpp/src/main.cpp Outdated
Comment thread tts-cpp/src/t3_mtl.cpp
Comment thread tts-cpp/test/test_multilingual_synth.cpp Outdated
Comment thread tts-cpp/src/main.cpp
Comment thread tts-cpp/CMakeLists.txt
…probe cast)

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0) KV cache:

    ggml_metal_op_encode_impl: error: unsupported op 'CONT'   (in eval_step_mtl)

Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided
K/V cache fine on Metal. Walking the decode graph showed the only
Metal-unsupported op was the per-(layer,head) alignment probe in
build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a
mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path
runs a single-backend graph_compute (no scheduler fallback), so it crashed at
encode time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default)
shipped a broken MTL Metal path undetected.

Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32).
Metal supports a dequantizing copy of a strided quantized view (verified against
ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no
contiguity check, whereas the old q8->q8 cont hit `default: return false`).
For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8
KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn
runs its matmul at f16 internally regardless of KV storage dtype).

Cross-backend safety net: removing the blanket "f32 on any non-CPU backend"
guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now
probes the exact align-probe cast op per-backend and falls back to f32 when the
backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds),
instead of a name/type check. Vulkan quantized K/V stays force-f32'd in
chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into
chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable.

Tests:
  - test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the
    cross-backend fallback branch (cast unsupported -> f32) via the pure helper.
  - test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and
    CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe
    use, so the sentinel mirrors the real op.
  - test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang>
    ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail.
  - test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to
    catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it).

Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro
Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf
is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and
q8 KV is no slower than f16).

Refs QVAC-19557

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ogad-tether ogad-tether force-pushed the feat/qvac-19557-mtl-q8-gpu-real-fix branch from 073c7b7 to 12749a0 Compare June 29, 2026 14:32
@ogad-tether ogad-tether changed the title tts-cpp: chatterbox-mtl — real q8-on-GPU fix (dequant align probe; drop f32 guard) tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast) Jun 29, 2026
@ogad-tether ogad-tether changed the base branch from feat/qvac-19557-mtl-metal-q8-guard to master June 29, 2026 14:33
@ogad-tether

Copy link
Copy Markdown
Author

Thanks for the review — all five inline comments addressed (fixed + replied + resolved). On the three summary points:

  1. Target → master ✅ — base retargeted to master; tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT) #70 closed as superseded.

  2. Squash ✅ — the branch is now one clean commit off master. The guard add→remove churn is squashed out; verified the net diff touches only the 8 q8 files (no chatterbox_mtl_guard_kv_type left, and the lavasr enhancer work on master is untouched — the branch was rebased onto current master).

  3. End-to-end validation — this repo has no on-PR CI (tts-cpp is validated downstream via the @qvac/tts-ggml release), so here's the evidence:

    • Local ctest: test_kv_cache_type all pass (incl. the new cross-backend fallback-branch tests); test_metal_ops gpu sentinel ok with the aligned 2D shape.
    • macOS Metal (M2): q8 MTL synthesizes across es/fr/de/pt through the real addon + ggml-speech — no CONT crash.
    • On-device iOS (iPhone 17 Pro Max, A19 Pro Metal): the addon built from this exact branch → runChatterboxMtlTest: PASS (2/2), q8 flash-attn kernel confirmed on-device, zero CONT crash.
    • Perf: before/after contcast on f32/f16 is within run-to-run noise (perf-neutral), and q8 ≈ f16 — proven with an A/B where the cont build still crashes on q8 (confirming the swap was real).

Ready for re-review.

@github-actions

Copy link
Copy Markdown

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

@ogad-tether ogad-tether requested a review from GustavoA1604 June 29, 2026 15:03
@ogad-tether ogad-tether self-assigned this Jun 29, 2026
…ual gap)

Follow-up to the q8-KV-on-GPU change in this PR. The OpenCL (Adreno) backend
has the same advertise-vs-actual supports_op gap already guarded for Vulkan:
it reports both the q8_0 flash-attn and the align-probe's strided q8->f32 cast
as supported, but the driver SIGSEGVs on the quantized cache at model load
(clEnqueueWriteBuffer inside tts_cpp::chatterbox::Engine::Engine).

Removing the old blanket "f32 on any non-CPU backend" guard (so Metal could
run q8 KV) re-exposed q8 as the default on every GPU backend, including Adreno.
Device-farm confirmed (QVAC-19557): the multilingual GPU load SIGSEGVs on a
Samsung Galaxy S25 Ultra (Adreno) with a q8_0 KV cache, while the identical
f16/f32 cache passes and a Pixel 9 (Mali-Vulkan, already force-f32'd above)
loads all 10 tests fine. iOS/Metal (validated) and the Vulkan coopmat2 path
were already covered; OpenCL/Adreno was the one unguarded GPU family.

Mirror the Vulkan guard in chatterbox_resolve_kv_type: force quantized K/V to
f32 on OpenCL via backend_is_opencl(). q8 KV stays on Metal (validated);
f16/f32 are unaffected. Like the Vulkan guard this is inline (no GPU backend in
the linux-x64 cpp-tests); its authoritative test is the Android device-farm E2E
(S25 Ultra + Pixel 9).

Refs QVAC-19557

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants