You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (no quantized CONT kernel)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0/q4_0) KV cache:
ggml_metal_op_encode_impl: error: unsupported op 'CONT'
...ggml-metal-ops.cpp:203: unsupported op (in eval_step_mtl)
Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major
K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn
path materialises through a CONT. ggml-metal has no CONT kernel for quantized
tensors. Because the MTL path runs a single-backend graph_compute (not the
multi-backend scheduler), ggml_backend_supports_op is never consulted at
runtime — so the CONT is placed on Metal unconditionally and rejected at encode
time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not the downstream CONT, which is why ggml-org#2527 (q8 KV as the
default) shipped a broken MTL Metal path undetected.
Fix: chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU
backend (where the quantized CONT is supported) and returns f32 on any GPU
backend. Gating on "not CPU" by device type rather than a backend name is
deliberately robust across ggml builds whose Metal registry name differs
("Metal" vs "MTL" — the latter is what stock ggml reports, so a name-based
check silently never matched). Composes with the existing Vulkan force-f32
inside chatterbox_resolve_kv_type.
Scope: MTL variant only. The Turbo variant (separate eval path, in the gated
SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.
Tests (this is the coverage gap that let it ship):
- test_kv_cache_type: unit-tests chatterbox_mtl_guard_kv_type's pass-through
branches (CPU keeps q8/f16/f32; null backend is a no-op) — runs on every
fleet.
- test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal,
CONT(q8_0 strided)) == false — the exact limitation behind the crash — and
that the f32 fallback target + CPU quantized CONT stay supported. Trips the
day ggml grows a quantized CONT kernel, signalling the guard can be relaxed.
This is a correctness stopgap: it stops the crash but gives back the q8 KV
memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at
n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG
attention to avoid the strided-quantized-view CONT, together with an end-to-end
MTL x backend x kv-type synthesis matrix on the device fleets.
Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf
ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after.
Refs QVAC-19557
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0 commit comments