tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70
Closed
ogad-tether wants to merge 1 commit into
Closed
tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70ogad-tether wants to merge 1 commit into
ogad-tether wants to merge 1 commit into
Conversation
Review StatusCurrent Status: ❌ PENDING Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member. |
…NT kernel)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0/q4_0) KV cache:
ggml_metal_op_encode_impl: error: unsupported op 'CONT'
...ggml-metal-ops.cpp:203: unsupported op (in eval_step_mtl)
Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major
K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn
path materialises through a CONT. ggml-metal has no CONT kernel for quantized
tensors. Because the MTL path runs a single-backend graph_compute (not the
multi-backend scheduler), ggml_backend_supports_op is never consulted at
runtime — so the CONT is placed on Metal unconditionally and rejected at encode
time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not the downstream CONT, which is why ggml-org#2527 (q8 KV as the
default) shipped a broken MTL Metal path undetected.
Fix: chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU
backend (where the quantized CONT is supported) and returns f32 on any GPU
backend. Gating on "not CPU" by device type rather than a backend name is
deliberately robust across ggml builds whose Metal registry name differs
("Metal" vs "MTL" — the latter is what stock ggml reports, so a name-based
check silently never matched). Composes with the existing Vulkan force-f32
inside chatterbox_resolve_kv_type.
Scope: MTL variant only. The Turbo variant (separate eval path, in the gated
SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.
Tests (this is the coverage gap that let it ship):
- test_kv_cache_type: unit-tests chatterbox_mtl_guard_kv_type's pass-through
branches (CPU keeps q8/f16/f32; null backend is a no-op) — runs on every
fleet.
- test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal,
CONT(q8_0 strided)) == false — the exact limitation behind the crash — and
that the f32 fallback target + CPU quantized CONT stay supported. Trips the
day ggml grows a quantized CONT kernel, signalling the guard can be relaxed.
This is a correctness stopgap: it stops the crash but gives back the q8 KV
memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at
n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG
attention to avoid the strided-quantized-view CONT, together with an end-to-end
MTL x backend x kv-type synthesis matrix on the device fleets.
Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf
ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after.
Refs QVAC-19557
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1a43ae6 to
3917edd
Compare
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend whenever the KV cache is quantized (q8_0/q4_0):
Reported from the field on macOS Apple Silicon / arm64 with
useGPU: true(@qvac/tts-ggml@0.3.x):nGpuLayers0/32/99 all crash identically;loadModelsucceeds; CPU (useGPU:false) works.0.2.5GPU worked because q8 KV was opt-in.Root cause
The MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (
build_llama_block,ggml_view_4datt3_mtl.cpp:783), which the GPU flash-attn path materialises through aCONT. ggml-metal has noCONTkernel for quantized tensors. Because the MTL path runs a single-backendggml_backend_graph_compute(not the multi-backend scheduler),ggml_backend_supports_opis never consulted at runtime — so theCONTis placed on Metal unconditionally and rejected at encode time.The capability probe in
chatterbox_resolve_kv_typeonly validatesflash_attn_ext, not the downstreamCONT— which is why tetherto/qvac#2527 (q8 KV as the default for all variants) shipped a broken MTL Metal path undetected. The gated SDK e2e (consumer.ts) covers Turbo only, so the MTL path was never exercised on Metal.Fix
chatterbox_mtl_guard_kv_type()restricts a quantized KV cache to the CPU backend (where the quantizedCONTis supported) and returns f32 on any GPU backend:!backend_is_cpu(...)by device type rather than a backend name — deliberately robust across ggml builds whose Metal registry name differs ("Metal"vs"MTL"; stock ggml reportsMTL, so a name-based check silently never matched).chatterbox_resolve_kv_type.Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.
Tests (the coverage gap that let this ship)
test_kv_cache_type(unit, all fleets): unit-testschatterbox_mtl_guard_kv_type's pass-through branches — CPU keeps q8/f16/f32, null backend is a no-op. Protects the guard against being removed/broken.test_metal_ops(gpulabel): assertsggml_backend_supports_op(metal, CONT(q8_0 strided)) == false— the exact limitation behind the crash — plus that the f32 fallback target and CPU quantizedCONTstay supported. Trips the day ggml grows a quantizedCONTkernel, signalling the guard can be relaxed and GPU q8 KV revisited.Validation (local, stock ggml 0.13.0 Metal, M2)
CONT)test_kv_cache_typetest_metal_opsquantized-cont sentinelok (Metal CONT(q8_0) unsupported…)✅Trade-off / follow-up
This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is ~4× the q8 footprint, allocated eagerly at
n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-viewCONT, together with an end-to-end MTL × backend × kv-type synthesis matrix on the device fleets (the unit/op tests here cover the guard logic and the ggml invariant, not full on-device synthesis).Refs QVAC-19557
🤖 Generated with Claude Code