Skip to content

tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70

Closed
ogad-tether wants to merge 1 commit into
masterfrom
feat/qvac-19557-mtl-metal-q8-guard
Closed

tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70
ogad-tether wants to merge 1 commit into
masterfrom
feat/qvac-19557-mtl-metal-q8-guard

Conversation

@ogad-tether

@ogad-tether ogad-tether commented Jun 26, 2026

Copy link
Copy Markdown

Problem

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend whenever the KV cache is quantized (q8_0/q4_0):

ggml_metal_op_encode_impl: error: unsupported op 'CONT'
…ggml-metal-ops.cpp:203: unsupported op            (in chatterbox::detail::eval_step_mtl)

Reported from the field on macOS Apple Silicon / arm64 with useGPU: true (@qvac/tts-ggml@0.3.x): nGpuLayers 0/32/99 all crash identically; loadModel succeeds; CPU (useGPU:false) works. 0.2.5 GPU worked because q8 KV was opt-in.

Root cause

The MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (build_llama_block, ggml_view_4d at t3_mtl.cpp:783), which the GPU flash-attn path materialises through a CONT. ggml-metal has no CONT kernel for quantized tensors. Because the MTL path runs a single-backend ggml_backend_graph_compute (not the multi-backend scheduler), ggml_backend_supports_op is never consulted at runtime — so the CONT is placed on Metal unconditionally and rejected at encode time.

The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not the downstream CONT — which is why tetherto/qvac#2527 (q8 KV as the default for all variants) shipped a broken MTL Metal path undetected. The gated SDK e2e (consumer.ts) covers Turbo only, so the MTL path was never exercised on Metal.

Fix

chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU backend (where the quantized CONT is supported) and returns f32 on any GPU backend:

  • Gates on !backend_is_cpu(...) by device type rather than a backend name — deliberately robust across ggml builds whose Metal registry name differs ("Metal" vs "MTL"; stock ggml reports MTL, so a name-based check silently never matched).
  • Composes with the existing Vulkan force-f32 inside chatterbox_resolve_kv_type.

Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.

Tests (the coverage gap that let this ship)

  • test_kv_cache_type (unit, all fleets): unit-tests chatterbox_mtl_guard_kv_type's pass-through branches — CPU keeps q8/f16/f32, null backend is a no-op. Protects the guard against being removed/broken.
  • test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal, CONT(q8_0 strided)) == false — the exact limitation behind the crash — plus that the f32 fallback target and CPU quantized CONT stay supported. Trips the day ggml grows a quantized CONT kernel, signalling the guard can be relaxed and GPU q8 KV revisited.

Validation (local, stock ggml 0.13.0 Metal, M2)

Path Before After
MTL + Metal + q8_0 SIGABRT (CONT) f32 fallback, synthesizes ✅
MTL + CPU + q8_0 works unchanged (keeps q8) ✅
Turbo + Metal + q8_0 works unchanged (keeps q8) ✅
test_kv_cache_type all checks pass ✅
test_metal_ops quantized-cont sentinel ok (Metal CONT(q8_0) unsupported…)
eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf ref.wav 99 q8_0
# before: SIGABRT 'CONT'   after: "…using f32 KV cache" + synth1=72000 samples

Trade-off / follow-up

This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is ~4× the q8 footprint, allocated eagerly at n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-view CONT, together with an end-to-end MTL × backend × kv-type synthesis matrix on the device fleets (the unit/op tests here cover the guard logic and the ggml invariant, not full on-device synthesis).

Refs QVAC-19557

🤖 Generated with Claude Code

@ogad-tether ogad-tether requested review from a team as code owners June 26, 2026 19:26
@github-actions

Copy link
Copy Markdown

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

…NT kernel)

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the
Metal backend with a quantized (q8_0/q4_0) KV cache:

    ggml_metal_op_encode_impl: error: unsupported op 'CONT'
    ...ggml-metal-ops.cpp:203: unsupported op   (in eval_step_mtl)

Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major
K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn
path materialises through a CONT. ggml-metal has no CONT kernel for quantized
tensors. Because the MTL path runs a single-backend graph_compute (not the
multi-backend scheduler), ggml_backend_supports_op is never consulted at
runtime — so the CONT is placed on Metal unconditionally and rejected at encode
time. The capability probe in chatterbox_resolve_kv_type only validates
flash_attn_ext, not the downstream CONT, which is why ggml-org#2527 (q8 KV as the
default) shipped a broken MTL Metal path undetected.

Fix: chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU
backend (where the quantized CONT is supported) and returns f32 on any GPU
backend. Gating on "not CPU" by device type rather than a backend name is
deliberately robust across ggml builds whose Metal registry name differs
("Metal" vs "MTL" — the latter is what stock ggml reports, so a name-based
check silently never matched). Composes with the existing Vulkan force-f32
inside chatterbox_resolve_kv_type.

Scope: MTL variant only. The Turbo variant (separate eval path, in the gated
SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.

Tests (this is the coverage gap that let it ship):
  - test_kv_cache_type: unit-tests chatterbox_mtl_guard_kv_type's pass-through
    branches (CPU keeps q8/f16/f32; null backend is a no-op) — runs on every
    fleet.
  - test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal,
    CONT(q8_0 strided)) == false — the exact limitation behind the crash — and
    that the f32 fallback target + CPU quantized CONT stay supported. Trips the
    day ggml grows a quantized CONT kernel, signalling the guard can be relaxed.

This is a correctness stopgap: it stops the crash but gives back the q8 KV
memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at
n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG
attention to avoid the strided-quantized-view CONT, together with an end-to-end
MTL x backend x kv-type synthesis matrix on the device fleets.

Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf
ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after.

Refs QVAC-19557

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ogad-tether

Copy link
Copy Markdown
Author

Superseded by #71, which is now a standalone fix off master (the real q8-on-GPU fix via the align-probe dequant cast, replacing this f32-fallback stopgap). Closing in favor of #71.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant