You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0) KV cache:
ggml_metal_op_encode_impl: error: unsupported op 'CONT' (in eval_step_mtl)
Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided
K/V cache fine on Metal. Walking the decode graph showed the only
Metal-unsupported op was the per-(layer,head) alignment probe in
build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a
mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path
runs a single-backend graph_compute (no scheduler fallback), so it crashed at
encode time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default)
shipped a broken MTL Metal path undetected.
Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32).
Metal supports a dequantizing copy of a strided quantized view (verified against
ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no
contiguity check, whereas the old q8->q8 cont hit `default: return false`).
For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8
KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn
runs its matmul at f16 internally regardless of KV storage dtype).
Cross-backend safety net: removing the blanket "f32 on any non-CPU backend"
guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now
probes the exact align-probe cast op per-backend and falls back to f32 when the
backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds),
instead of a name/type check. Vulkan quantized K/V stays force-f32'd in
chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into
chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable.
Tests:
- test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the
cross-backend fallback branch (cast unsupported -> f32) via the pure helper.
- test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and
CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe
use, so the sentinel mirrors the real op.
- test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang>
ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail.
- test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to
catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it).
Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro
Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf
is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and
q8 KV is no slower than f16).
Refs QVAC-19557
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0 commit comments