Skip to content

Commit 1a43ae6

Browse files
ogad-tetherclaude
andcommitted
tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (no quantized CONT kernel)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0/q4_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' ...ggml-metal-ops.cpp:203: unsupported op (in eval_step_mtl) Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn path materialises through a CONT. ggml-metal has no CONT kernel for quantized tensors, so any quantized KV cache crashes on Metal. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not the downstream CONT, so it cannot catch this — which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: for the MTL variant, restrict a quantized KV cache to the CPU backend (where the quantized CONT is supported) and fall back to f32 on any GPU backend. Gating on "not CPU" rather than a backend name is deliberately robust across ggml builds whose Metal registry name differs ("Metal" vs "MTL"), and composes with the existing Vulkan force-f32 inside chatterbox_resolve_kv_type. Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal. This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-view CONT, together with the missing MTL x backend x kv-type e2e coverage. Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after. Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 586268b commit 1a43ae6

1 file changed

Lines changed: 21 additions & 0 deletions

File tree

tts-cpp/src/t3_mtl.cpp

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1830,6 +1830,27 @@ bool load_model_gguf_mtl(const std::string & path,
18301830
// attention with the requested quantized/f16 K/V.
18311831
hp.kv_type = chatterbox_resolve_kv_type(model.backend, kv_type,
18321832
hp.head_dim, hp.n_head, hp.n_kv_head);
1833+
// QVAC-19557: the MTL variant's batched-CFG (B=2) decode reads the
1834+
// token-major K/V cache as a 4D strided view (build_llama_block), which
1835+
// the GPU flash-attn path materialises through a CONT. ggml-metal has no
1836+
// CONT kernel for quantized tensors, so a quantized KV cache SIGABRTs at
1837+
// eval_step_mtl ("unsupported op 'CONT'") on Metal. The capability probe
1838+
// in chatterbox_resolve_kv_type only validates flash_attn_ext support, not
1839+
// the downstream CONT, so it can't catch this. Quantized KV for the MTL
1840+
// variant is only validated on the CPU backend (where the quantized CONT
1841+
// is supported); fall back to f32 on any GPU backend (Vulkan is also
1842+
// force-f32'd inside chatterbox_resolve_kv_type). Gating on "not CPU"
1843+
// rather than a backend name keeps this robust across ggml builds whose
1844+
// Metal registry name varies ("Metal" vs "MTL"). The Turbo variant uses
1845+
// a separate eval path that does not hit the offending CONT and keeps
1846+
// quantized KV on GPU (verified on Metal).
1847+
if (ggml_is_quantized(hp.kv_type) &&
1848+
!::tts_cpp::detail::backend_is_cpu(model.backend)) {
1849+
fprintf(stderr, "chatterbox(mtl): quantized (%s) KV cache is only supported on the "
1850+
"CPU backend for the multilingual variant (GPU CONT on quantized "
1851+
"K/V is unsupported); using f32 KV cache\n", ggml_type_name(hp.kv_type));
1852+
hp.kv_type = GGML_TYPE_F32;
1853+
}
18331854
ggml_init_params kv_params = { ggml_tensor_overhead() * 4, nullptr, true };
18341855
model.ctx_kv = ggml_init(kv_params);
18351856
const int64_t kv_elements_b2 =

0 commit comments

Comments
 (0)