tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT) by ogad-tether · Pull Request #70 · tetherto/qvac-ext-lib-whisper.cpp

ogad-tether · 2026-06-26T19:26:28Z

Problem

The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend whenever the KV cache is quantized (q8_0/q4_0):

ggml_metal_op_encode_impl: error: unsupported op 'CONT'
…ggml-metal-ops.cpp:203: unsupported op            (in chatterbox::detail::eval_step_mtl)

Reported from the field on macOS Apple Silicon / arm64 with useGPU: true (@qvac/tts-ggml@0.3.x): nGpuLayers 0/32/99 all crash identically; loadModel succeeds; CPU (useGPU:false) works. 0.2.5 GPU worked because q8 KV was opt-in.

Root cause

The MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (build_llama_block, ggml_view_4d at t3_mtl.cpp:783), which the GPU flash-attn path materialises through a CONT. ggml-metal has no CONT kernel for quantized tensors. Because the MTL path runs a single-backend ggml_backend_graph_compute (not the multi-backend scheduler), ggml_backend_supports_op is never consulted at runtime — so the CONT is placed on Metal unconditionally and rejected at encode time.

The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not the downstream CONT — which is why tetherto/qvac#2527 (q8 KV as the default for all variants) shipped a broken MTL Metal path undetected. The gated SDK e2e (consumer.ts) covers Turbo only, so the MTL path was never exercised on Metal.

Fix

chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU backend (where the quantized CONT is supported) and returns f32 on any GPU backend:

Gates on !backend_is_cpu(...) by device type rather than a backend name — deliberately robust across ggml builds whose Metal registry name differs ("Metal" vs "MTL"; stock ggml reports MTL, so a name-based check silently never matched).
Composes with the existing Vulkan force-f32 inside chatterbox_resolve_kv_type.

Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal.

Tests (the coverage gap that let this ship)

test_kv_cache_type (unit, all fleets): unit-tests chatterbox_mtl_guard_kv_type's pass-through branches — CPU keeps q8/f16/f32, null backend is a no-op. Protects the guard against being removed/broken.
test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal, CONT(q8_0 strided)) == false — the exact limitation behind the crash — plus that the f32 fallback target and CPU quantized CONT stay supported. Trips the day ggml grows a quantized CONT kernel, signalling the guard can be relaxed and GPU q8 KV revisited.

Validation (local, stock ggml 0.13.0 Metal, M2)

Path	Before	After
MTL + Metal + q8_0	SIGABRT (`CONT`)	f32 fallback, synthesizes ✅
MTL + CPU + q8_0	works	unchanged (keeps q8) ✅
Turbo + Metal + q8_0	works	unchanged (keeps q8) ✅
`test_kv_cache_type`	—	all checks pass ✅
`test_metal_ops` quantized-cont sentinel	—	`ok (Metal CONT(q8_0) unsupported…)` ✅

eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf ref.wav 99 q8_0
# before: SIGABRT 'CONT'   after: "…using f32 KV cache" + synth1=72000 samples

Trade-off / follow-up

This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is ~4× the q8 footprint, allocated eagerly at n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-view CONT, together with an end-to-end MTL × backend × kv-type synthesis matrix on the device fleets (the unit/op tests here cover the guard logic and the ggml invariant, not full on-device synthesis).

Refs QVAC-19557

🤖 Generated with Claude Code

github-actions · 2026-06-26T19:26:38Z

Review Status

Current Status: ❌ PENDING
Approvals so far: none

Pending reviews: Needs 1 Management or Team Lead, and 1 more from Management, Team Lead, or Member.

…NT kernel) The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0/q4_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' ...ggml-metal-ops.cpp:203: unsupported op (in eval_step_mtl) Root cause: the MTL variant's batched-CFG (B=2) decode reads the token-major K/V cache as a 4D strided view (build_llama_block), which the GPU flash-attn path materialises through a CONT. ggml-metal has no CONT kernel for quantized tensors. Because the MTL path runs a single-backend graph_compute (not the multi-backend scheduler), ggml_backend_supports_op is never consulted at runtime — so the CONT is placed on Metal unconditionally and rejected at encode time. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not the downstream CONT, which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: chatterbox_mtl_guard_kv_type() restricts a quantized KV cache to the CPU backend (where the quantized CONT is supported) and returns f32 on any GPU backend. Gating on "not CPU" by device type rather than a backend name is deliberately robust across ggml builds whose Metal registry name differs ("Metal" vs "MTL" — the latter is what stock ggml reports, so a name-based check silently never matched). Composes with the existing Vulkan force-f32 inside chatterbox_resolve_kv_type. Scope: MTL variant only. The Turbo variant (separate eval path, in the gated SDK e2e) is unaffected and keeps quantized KV on GPU — verified on Metal. Tests (this is the coverage gap that let it ship): - test_kv_cache_type: unit-tests chatterbox_mtl_guard_kv_type's pass-through branches (CPU keeps q8/f16/f32; null backend is a no-op) — runs on every fleet. - test_metal_ops (gpu label): asserts ggml_backend_supports_op(metal, CONT(q8_0 strided)) == false — the exact limitation behind the crash — and that the f32 fallback target + CPU quantized CONT stay supported. Trips the day ggml grows a quantized CONT kernel, signalling the guard can be relaxed. This is a correctness stopgap: it stops the crash but gives back the q8 KV memory saving on MTL-GPU (f32 is 4x the q8 footprint, allocated eagerly at n_ctx). A follow-up will recover GPU q8 for MTL by reworking the batched-CFG attention to avoid the strided-quantized-view CONT, together with an end-to-end MTL x backend x kv-type synthesis matrix on the device fleets. Repro (local): eng_footprint_driver chatterbox-t3-mtl.gguf chatterbox-s3gen-mtl.gguf ref.wav 99 q8_0 -> SIGABRT before fix, synthesizes (f32 fallback) after. Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ogad-tether · 2026-06-29T14:33:24Z

Superseded by #71, which is now a standalone fix off master (the real q8-on-GPU fix via the align-probe dequant cast, replacing this f32-fallback stopgap). Closing in favor of #71.

ogad-tether requested review from a team as code owners June 26, 2026 19:26

ogad-tether force-pushed the feat/qvac-19557-mtl-metal-q8-guard branch from 1a43ae6 to 3917edd Compare June 26, 2026 19:35

ogad-tether mentioned this pull request Jun 29, 2026

tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast) #71

Open

ogad-tether closed this Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70

tts-cpp: chatterbox-mtl — fall back to f32 KV on GPU (fix Metal q8 CONT SIGABRT)#70
ogad-tether wants to merge 1 commit into
masterfrom
feat/qvac-19557-mtl-metal-q8-guard

ogad-tether commented Jun 26, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 26, 2026

Uh oh!

ogad-tether commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ogad-tether commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Tests (the coverage gap that let this ship)

Validation (local, stock ggml 0.13.0 Metal, M2)

Trade-off / follow-up

Uh oh!

github-actions Bot commented Jun 26, 2026

Review Status

Uh oh!

ogad-tether commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ogad-tether commented Jun 26, 2026 •

edited

Loading