Skip to content

Commit 3917edd

Browse files
ogad-tetherclaude
andcommitted
tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (no quantized CONT kernel)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0/q4_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' ...ggml-metal-ops.cpp:203: unsupported op (in eval_step_mtl) Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn path materialises through a CONT. ggml-metal has no CONT kernel for quantized tensors. Because the MTL path runs a single-backend graph_compute (not the multi-backend scheduler), ggml_backend_supports_op is never consulted at runtime — so the CONT is placed on Metal unconditionally and rejected at encode time. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not the downstream CONT, which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU backend (where the quantized CONT is supported) and returns f32 on any GPU backend. Gating on "not CPU" by device type rather than a backend name is deliberately robust across ggml builds whose Metal registry name differs ("Metal" vs "MTL" — the latter is what stock ggml reports, so a name-based check silently never matched). Composes with the existing Vulkan force-f32 inside chatterbox_resolve_kv_type. Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal. Tests (this is the coverage gap that let it ship): - test_kv_cache_type: unit-tests chatterbox_mtl_guard_kv_type's pass-through branches (CPU keeps q8/f16/f32; null backend is a no-op) — runs on every fleet. - test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal, CONT(q8_0 strided)) == false — the exact limitation behind the crash — and that the f32 fallback target + CPU quantized CONT stay supported. Trips the day ggml grows a quantized CONT kernel, signalling the guard can be relaxed. This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-view CONT, together with an end-to-end MTL x backend x kv-type synthesis matrix on the device fleets. Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after. Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 586268b commit 3917edd

5 files changed

Lines changed: 107 additions & 0 deletions

File tree

tts-cpp/src/chatterbox_t3_internal.h

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,19 @@ ggml_type chatterbox_kv_type_from_str(const std::string & s);
119119
ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
120120
int head_dim, int n_head, int n_kv_head);
121121

122+
// MTL-variant-only guard (QVAC-19557): the multilingual variant's batched-CFG
123+
// (B=2) decode reads the token-major K/V cache as a 4D strided view, which the
124+
// GPU flash-attn path materialises through a CONT. ggml-metal has no CONT
125+
// kernel for quantized tensors, so a quantized KV cache SIGABRTs at encode time
126+
// on Metal (the MTL path runs a single-backend graph_compute, so the scheduler
127+
// never gets to fall the op back to CPU). This restricts a quantized `kv_type`
128+
// to the CPU backend and returns GGML_TYPE_F32 on any GPU backend; non-quantized
129+
// types and a null/CPU backend pass through unchanged. Pure (no I/O) so the
130+
// caller logs the downgrade and so it stays unit-testable. The Turbo variant
131+
// uses a different eval path that does not hit the CONT and must NOT be routed
132+
// through this guard.
133+
ggml_type chatterbox_mtl_guard_kv_type(ggml_backend_t backend, ggml_type kv_type);
134+
122135
struct gpt2_layer {
123136
ggml_tensor * ln_1_g = nullptr;
124137
ggml_tensor * ln_1_b = nullptr;

tts-cpp/src/main.cpp

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,20 @@ ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested
402402
return requested;
403403
}
404404

405+
ggml_type chatterbox_mtl_guard_kv_type(ggml_backend_t backend, ggml_type kv_type) {
406+
// Quantized K/V is only safe on CPU for the MTL variant: the GPU flash-attn
407+
// path CONTs the strided quantized K/V cache, and ggml-metal has no CONT
408+
// kernel for quantized tensors (the resolve probe above validates
409+
// flash_attn_ext but not the downstream CONT, so it can't catch this). Gate
410+
// on "not CPU" by device type rather than a backend name so it stays robust
411+
// across ggml builds whose Metal registry name differs ("Metal" vs "MTL").
412+
if (ggml_is_quantized(kv_type) && backend &&
413+
!::tts_cpp::detail::backend_is_cpu(backend)) {
414+
return GGML_TYPE_F32;
415+
}
416+
return kv_type;
417+
}
418+
405419
bool load_model_gguf(const std::string & path, chatterbox_model & model, int requested_ctx, int n_gpu_layers, ggml_type kv_type) {
406420
{
407421
gguf_init_params peek_params = { /*.no_alloc=*/ true, /*.ctx=*/ nullptr };

tts-cpp/src/t3_mtl.cpp

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1830,6 +1830,23 @@ bool load_model_gguf_mtl(const std::string & path,
18301830
// attention with the requested quantized/f16 K/V.
18311831
hp.kv_type = chatterbox_resolve_kv_type(model.backend, kv_type,
18321832
hp.head_dim, hp.n_head, hp.n_kv_head);
1833+
// QVAC-19557: the MTL variant's batched-CFG (B=2) decode CONTs the
1834+
// strided quantized K/V cache, which ggml-metal can't do (no quantized
1835+
// CONT kernel) — so a quantized KV cache SIGABRTs at eval_step_mtl
1836+
// ("unsupported op 'CONT'") on Metal. The resolve probe above only
1837+
// validates flash_attn_ext, not the downstream CONT, so the guard below
1838+
// restricts quantized K/V to the CPU backend. See
1839+
// chatterbox_mtl_guard_kv_type for the full rationale; it is pure so we
1840+
// log the downgrade here.
1841+
{
1842+
const ggml_type guarded = chatterbox_mtl_guard_kv_type(model.backend, hp.kv_type);
1843+
if (guarded != hp.kv_type) {
1844+
fprintf(stderr, "chatterbox(mtl): quantized (%s) KV cache is only supported on the "
1845+
"CPU backend for the multilingual variant (GPU CONT on quantized "
1846+
"K/V is unsupported); using f32 KV cache\n", ggml_type_name(hp.kv_type));
1847+
hp.kv_type = guarded;
1848+
}
1849+
}
18331850
ggml_init_params kv_params = { ggml_tensor_overhead() * 4, nullptr, true };
18341851
model.ctx_kv = ggml_init(kv_params);
18351852
const int64_t kv_elements_b2 =

tts-cpp/test/test_kv_cache_type.cpp

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,22 @@ int main() {
6666
CHECK(chatterbox_resolve_kv_type(cpu, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
6767
== GGML_TYPE_Q8_0, "cpu retains q8_0 KV");
6868

69+
// ---- MTL guard (QVAC-19557): quantized K/V only on CPU ----
70+
// The multilingual variant's batched-CFG decode CONTs the strided quantized
71+
// K/V cache, which ggml-metal can't do; the guard restricts quantized K/V to
72+
// the CPU backend. Here we cover the pass-through branches that hold on any
73+
// runner; the GPU->f32 downgrade is covered (Metal) in test_metal_ops.cpp.
74+
CHECK(chatterbox_mtl_guard_kv_type(cpu, GGML_TYPE_Q8_0) == GGML_TYPE_Q8_0,
75+
"mtl guard: cpu keeps q8_0 (cpu has the quantized CONT kernel)");
76+
CHECK(chatterbox_mtl_guard_kv_type(cpu, GGML_TYPE_F16) == GGML_TYPE_F16,
77+
"mtl guard: cpu keeps f16");
78+
CHECK(chatterbox_mtl_guard_kv_type(cpu, GGML_TYPE_F32) == GGML_TYPE_F32,
79+
"mtl guard: cpu keeps f32");
80+
// Non-quantized types are never downgraded regardless of backend, and a null
81+
// backend is a no-op (null->f32 is chatterbox_resolve_kv_type's job upstream).
82+
CHECK(chatterbox_mtl_guard_kv_type(nullptr, GGML_TYPE_Q8_0) == GGML_TYPE_Q8_0,
83+
"mtl guard: null backend is a no-op");
84+
6985
ggml_backend_free(cpu);
7086

7187
if (g_failures) {

tts-cpp/test/test_metal_ops.cpp

Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,52 @@ static int test_mul_mm_fused(ggml_backend_t cpu, ggml_backend_t gpu,
335335
return 1;
336336
}
337337

338+
// QVAC-19557: regression sentinel for the MTL Metal q8-KV SIGABRT. The
339+
// multilingual Chatterbox variant's batched-CFG (B=2) decode reads the
340+
// token-major K/V cache as a strided 4D view, which the GPU flash-attn path
341+
// materialises through a CONT. ggml-metal has no CONT kernel for quantized
342+
// tensors, so that op is unsupported on Metal — and because the MTL path runs a
343+
// single-backend graph_compute (no scheduler fallback) it crashes at encode
344+
// time. chatterbox_mtl_guard_kv_type exists precisely for this; here we assert
345+
// the underlying ggml limitation directly so this test TRIPS the day ggml grows
346+
// a quantized CONT kernel, at which point the guard can be relaxed and GPU q8 KV
347+
// revisited. The guard's fallback target (f32 CONT) and the CPU quantized CONT
348+
// must both stay supported.
349+
static int test_quantized_cont_unsupported(ggml_backend_t cpu, ggml_backend_t gpu) {
350+
fprintf(stderr, "[quantized_cont] ");
351+
auto supports_cont = [](ggml_backend_t b, ggml_type t) {
352+
ggml_init_params p = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
353+
ggml_context * ctx = ggml_init(p);
354+
// Strided 4D view of a quantized src -> cont, mirroring the MTL
355+
// batched-CFG (B=2) token-major K/V read in build_llama_block.
356+
ggml_tensor * src = ggml_new_tensor_4d(ctx, t, 64, 256, 16, 2);
357+
ggml_tensor * view = ggml_view_4d(ctx, src, 64, 256, 16, 2,
358+
src->nb[1], src->nb[2] * 2, src->nb[3], 0);
359+
bool sup = ggml_backend_supports_op(b, ggml_cont(ctx, view));
360+
ggml_free(ctx);
361+
return sup;
362+
};
363+
int fails = 0;
364+
if (supports_cont(gpu, GGML_TYPE_Q8_0)) {
365+
fprintf(stderr, "\n FAIL: Metal now advertises CONT(q8_0) — revisit the MTL KV guard "
366+
"(chatterbox_mtl_guard_kv_type); GPU q8 KV may be possible again\n");
367+
++fails;
368+
}
369+
if (!supports_cont(gpu, GGML_TYPE_F32)) {
370+
fprintf(stderr, "\n FAIL: Metal CONT(f32) unsupported — the MTL guard's f32 fallback target is broken\n");
371+
++fails;
372+
}
373+
if (!supports_cont(cpu, GGML_TYPE_Q8_0)) {
374+
fprintf(stderr, "\n FAIL: CPU CONT(q8_0) unsupported — MTL keeps q8 KV on CPU and would break\n");
375+
++fails;
376+
}
377+
if (!fails) {
378+
fprintf(stderr, "ok (Metal CONT(q8_0) unsupported, as the MTL KV guard assumes)\n");
379+
return 0;
380+
}
381+
return 1;
382+
}
383+
338384
int main() {
339385
ggml_backend_t cpu = ggml_backend_cpu_init();
340386
if (!cpu) { fprintf(stderr, "CPU backend init failed\n"); return 1; }
@@ -350,6 +396,7 @@ int main() {
350396
}
351397

352398
int rc = 0;
399+
rc |= test_quantized_cont_unsupported(cpu, gpu);
353400
rc |= test_diag_mask_inf(cpu, gpu);
354401
rc |= test_pad_ext(cpu, gpu);
355402
// HiFT-sized shapes:

0 commit comments

Comments
 (0)