tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)

ogad-tether · claude · ogad-tether · commit 12749a05019f · 2026-06-29T15:32:52.000+01:00
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' (in eval_step_mtl) Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided K/V cache fine on Metal. Walking the decode graph showed the only Metal-unsupported op was the per-(layer,head) alignment probe in build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path runs a single-backend graph_compute (no scheduler fallback), so it crashed at encode time. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32). Metal supports a dequantizing copy of a strided quantized view (verified against ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no contiguity check, whereas the old q8->q8 cont hit `default: return false`). For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8 KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn runs its matmul at f16 internally regardless of KV storage dtype). Cross-backend safety net: removing the blanket "f32 on any non-CPU backend" guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now probes the exact align-probe cast op per-backend and falls back to f32 when the backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds), instead of a name/type check. Vulkan quantized K/V stays force-f32'd in chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable. Tests: - test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the cross-backend fallback branch (cast unsupported -> f32) via the pure helper. - test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe use, so the sentinel mirrors the real op. - test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang> ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail. - test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it). Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and q8 KV is no slower than f16). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
diff --git a/tts-cpp/CMakeLists.txt b/tts-cpp/CMakeLists.txt
@@ -733,6 +733,7 @@ if (TTS_CPP_BUILD_TESTS)
         )
         set_tests_properties(mtl-synth-${_lang} PROPERTIES
             LABELS "multilingual;mtl-${_tier}"
+            SKIP_RETURN_CODE 77
             TIMEOUT 180
         )
         if (_mtl_asr_enabled)
@@ -754,6 +755,44 @@ if (TTS_CPP_BUILD_TESTS)
         endif()
     endforeach()
 
+    # QVAC-19557: q8_0 KV-cache regression for the MTL variant.  A quantized KV
+    # cache used to SIGABRT on Metal ("unsupported op 'CONT'") via the alignment
+    # probe; the dequantizing-cast fix recovers q8 KV on the GPU.  These run the
+    # full synth with --kv-cache-type q8_0 and validate the WAV is real audio
+    # (RMS/peak/clipping/duration) — an end-to-end regression the op-level
+    # test_metal_ops sentinel can't cover.  A diverse script subset
+    # (Latin/Arabic/Cyrillic/Devanagari) doubles as cross-language coverage.
+    # IMPORTANT: these only exercise the *original crash path* on a Metal/GPU
+    # fleet — on a CPU-only runner q8 KV never crashed, so q8-vs-f32 there only
+    # checks the dequant-cast is numerically sane, not the SIGABRT.  Labelled
+    # `mtl-q8;gpu` so CI selects them on a Metal runner (`ctest -L mtl-q8`); the
+    # alignment-drift regression (CER) is covered by test-eos-roundtrip-q8-kv
+    # below.  Missing GGUFs -> SKIP (77), not fail, like the f32 tests above.
+    set(_mtl_q8_phrases
+        "en|Hello, this is a multilingual text-to-speech test."
+        "ar|مرحبًا، هذا اختبار متعدد اللغات لتحويل النص إلى كلام."
+        "ru|Привет, это многоязычный тест синтеза речи."
+        "hi|नमस्ते, यह एक बहुभाषी वाक् संश्लेषण परीक्षण है।"
+    )
+    foreach(_entry IN LISTS _mtl_q8_phrases)
+        string(REPLACE "|" ";" _parts "${_entry}")
+        list(GET _parts 0 _lang)
+        list(GET _parts 1 _text)
+        add_test(
+            NAME mtl-synth-q8-${_lang}
+            COMMAND $<TARGET_FILE:test-multilingual-synth>
+                --lang           "${_lang}"
+                --text           "${_text}"
+                --kv-cache-type  q8_0
+                --out            "${_mtl_out_dir}/${_lang}-q8.wav"
+        )
+        set_tests_properties(mtl-synth-q8-${_lang} PROPERTIES
+            LABELS "multilingual;mtl-q8;gpu"
+            SKIP_RETURN_CODE 77
+            TIMEOUT 180
+        )
+    endforeach()
+
     # End-to-end EOS round-trip regression.  Drives tts-cli to
     # synthesize a set of English phrases, transcribes with whisper-cli, and
     # asserts the transcription is close to the input (CER guard -> catches
@@ -790,6 +829,25 @@ if (TTS_CPP_BUILD_TESTS)
                      --lang          "en"
                      --tmp           "${_mtl_out_dir}"
             REQUIRES "${_tcb_t3_mtl_q4_gguf}" "${_tcb_s3gen_mtl_gguf}")
+
+        # QVAC-19557: same EOS round-trip but with a q8_0 KV CACHE (not a quantized
+        # T3).  The align-probe fix feeds the alignment mul_mat *dequantized* q8
+        # keys (ggml_cast), which could shift the argmax/softmax over text columns
+        # and thus EOS timing — something the WAV-sanity mtl-synth-q8-* tests can't
+        # see.  This drives the CER/ramble guards under q8 KV to catch alignment
+        # drift (clipping / early-cutoff / rambling) vs the f32 baseline above.
+        tts_cpp_register_test(test-eos-roundtrip-q8-kv
+            EXE      test-eos-roundtrip
+            LABEL    "fixture;asr"
+            ARGS     --tts-cli       "$<TARGET_FILE:tts-cli>"
+                     --t3            "${_tcb_t3_mtl_gguf}"
+                     --s3gen         "${_tcb_s3gen_mtl_gguf}"
+                     --whisper-cli   "${WHISPER_CLI}"
+                     --whisper-model "${WHISPER_MODEL}"
+                     --lang          "en"
+                     --kv-cache-type q8_0
+                     --tmp           "${_mtl_out_dir}"
+            REQUIRES "${_tcb_t3_mtl_gguf}" "${_tcb_s3gen_mtl_gguf}")
     else()
         message(STATUS "tts-cpp: test-eos-roundtrip built but not registered (needs whisper-cli + model + tts-cli)")
     endif()
diff --git a/tts-cpp/src/chatterbox_t3_internal.h b/tts-cpp/src/chatterbox_t3_internal.h
@@ -119,6 +119,21 @@ ggml_type chatterbox_kv_type_from_str(const std::string & s);
 ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
                                      int head_dim, int n_head, int n_kv_head);
 
+// MTL-variant resolve: chatterbox_resolve_kv_type plus a probe of the extra
+// quantized-cache op the multilingual decode graph emits — the alignment
+// probe's dequantizing cast of a strided q8 K-cache view to f32
+// (build_llama_block).  Returns f32 when the backend can't encode that cast, so
+// q8 KV stays enabled on backends that support it (Metal) and safely degrades on
+// those that don't, without the single-backend MTL graph SIGABRT'ing at compute.
+ggml_type chatterbox_mtl_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
+                                         int head_dim, int n_head, int n_kv_head);
+
+// Pure decision behind chatterbox_mtl_resolve_kv_type's cross-backend safety net,
+// exposed for unit testing: returns GGML_TYPE_F32 when `resolved` is quantized
+// and the backend cannot encode the align-probe's strided q8->f32 cast, else
+// returns `resolved` unchanged.
+ggml_type chatterbox_mtl_kv_type_for_cast_support(ggml_type resolved, bool cast_supported);
+
 struct gpt2_layer {
     ggml_tensor * ln_1_g = nullptr;
     ggml_tensor * ln_1_b = nullptr;
diff --git a/tts-cpp/src/main.cpp b/tts-cpp/src/main.cpp
@@ -402,6 +402,54 @@ ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested
     return requested;
 }
 
+ggml_type chatterbox_mtl_kv_type_for_cast_support(ggml_type resolved, bool cast_supported) {
+    // Pure decision split out of chatterbox_mtl_resolve_kv_type so the
+    // cross-backend safety net is unit-testable without a GPU that actually
+    // lacks the cast: a quantized KV type is downgraded to f32 when the backend
+    // cannot encode the align-probe's strided q8->f32 cast; non-quantized types
+    // (and the cast-supported case) pass through unchanged.
+    if (ggml_is_quantized(resolved) && !cast_supported) return GGML_TYPE_F32;
+    return resolved;
+}
+
+ggml_type chatterbox_mtl_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
+                                         int head_dim, int n_head, int n_kv_head) {
+    // Start from the shared resolve (flash_attn_ext probe + Vulkan coopmat2
+    // force-f32).  The MTL decode graph emits one MORE quantized-cache op the
+    // shared probe doesn't cover: the per-(layer,head) alignment probe
+    // dequantizes a STRIDED view of the quantized K cache via ggml_cast(...->f32)
+    // (build_llama_block).  ggml-metal supports that cast (which is why q8 KV now
+    // runs on Metal), but a GPU backend with thinner op coverage
+    // (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can advertise q8 flash-attn
+    // yet be unable to encode the strided q8->f32 cast — and the MTL path runs a
+    // single-backend graph_compute with no scheduler fallback, so that would
+    // SIGABRT at compute.  Probe the cast op directly and fall back to f32 when
+    // the backend can't encode it, instead of the old blanket "force f32 on any
+    // non-CPU backend" guard (which also blocked Metal, the whole bug).
+    ggml_type t = chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head);
+    if (!ggml_is_quantized(t) || !backend) return t;
+
+    bool cast_ok = false;
+    ggml_init_params pp = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
+    if (ggml_context * pc = ggml_init(pp)) {
+        // Mirror the align probe: a strided [head_dim, k] view of the token-major
+        // q8 cache, cast to f32.  Strides come from ggml_row_size so the view is
+        // block-aligned exactly as build_llama_block builds it.
+        const size_t tok_row = ggml_row_size(t, (size_t) head_dim * n_kv_head);
+        ggml_tensor * cache  = ggml_new_tensor_1d(pc, t, (int64_t) head_dim * n_kv_head * 8);
+        ggml_tensor * view   = ggml_view_2d(pc, cache, head_dim, 4, tok_row, 0);
+        ggml_tensor * cast   = ggml_cast(pc, view, GGML_TYPE_F32);
+        cast_ok = (cast != nullptr) && ggml_backend_supports_op(backend, cast);
+        ggml_free(pc);
+    }
+    const ggml_type eff = chatterbox_mtl_kv_type_for_cast_support(t, cast_ok);
+    if (eff != t) {
+        fprintf(stderr, "chatterbox(mtl): backend cannot encode the quantized-KV alignment "
+                        "cast (%s strided -> f32); using f32 KV cache\n", ggml_type_name(t));
+    }
+    return eff;
+}
+
 bool load_model_gguf(const std::string & path, chatterbox_model & model, int requested_ctx, int n_gpu_layers, ggml_type kv_type) {
     {
         gguf_init_params peek_params = { /*.no_alloc=*/ true, /*.ctx=*/ nullptr };
diff --git a/tts-cpp/src/t3_mtl.cpp b/tts-cpp/src/t3_mtl.cpp
@@ -714,7 +714,16 @@ ggml_tensor * build_llama_block(ggml_context * ctx, ggml_cgraph * gf,
                     layer_off + (size_t) probe_head * kv_head_row
                               + (size_t) ti * kv_tok_row);
             }
-            k_text = ggml_cont(ctx, k_text);
+            // Dequantize-and-pack the text-key slice to contiguous f32 for the
+            // mul_mat.  When the KV cache is quantized (q8_0), a plain ggml_cont
+            // would produce a quantized CONT, which ggml-metal can't encode (no
+            // quantized CONT kernel) — it SIGABRTs the whole decode.  ggml_cast
+            // to f32 is a dequantizing copy that Metal *does* support for a
+            // strided quantized view (QVAC-19557), and for an f32/f16 cache it
+            // degrades to a cheap cont/upcast.  This slice is tiny (HD × n_text
+            // for one head) and off the hot path, so the dequant cost is
+            // negligible.
+            k_text = ggml_cast(ctx, k_text, GGML_TYPE_F32);
             ggml_tensor * scores = ggml_mul_mat(ctx, k_text, q_h);        // (n_text, N)
             scores = ggml_scale(ctx, scores, 1.0f / std::sqrt((float) HD));
             ggml_tensor * aprobs = ggml_soft_max(ctx, scores);           // softmax over n_text
@@ -1828,8 +1837,18 @@ bool load_model_gguf_mtl(const std::string & path,
         // kv_layer_elems * sizeof(float).
         // Fall back to F32 KV if the resolved backend can't run flash
         // attention with the requested quantized/f16 K/V.
-        hp.kv_type = chatterbox_resolve_kv_type(model.backend, kv_type,
-                                                hp.head_dim, hp.n_head, hp.n_kv_head);
+        // QVAC-19557: a quantized (q8_0) KV cache used to SIGABRT on Metal
+        // ("unsupported op 'CONT'").  The cause was NOT flash-attention (which
+        // reads the q8 strided cache fine on Metal) but the per-(layer,head)
+        // alignment probe in build_llama_block, which ggml_cont'd a strided view
+        // of the quantized K cache to feed a mul_mat — and ggml-metal has no CONT
+        // kernel for quantized tensors.  That cont is now a dequantizing
+        // ggml_cast to f32 (Metal-supported), so quantized K/V runs on the GPU.
+        // chatterbox_mtl_resolve_kv_type probes that cast per-backend and falls
+        // back to f32 on any GPU backend that can't encode it (Vulkan coopmat2 is
+        // separately force-f32'd inside the shared resolve).
+        hp.kv_type = chatterbox_mtl_resolve_kv_type(model.backend, kv_type,
+                                                    hp.head_dim, hp.n_head, hp.n_kv_head);
         ggml_init_params kv_params = { ggml_tensor_overhead() * 4, nullptr, true };
         model.ctx_kv = ggml_init(kv_params);
         const int64_t kv_elements_b2 =
diff --git a/tts-cpp/test/test_eos_roundtrip.cpp b/tts-cpp/test/test_eos_roundtrip.cpp
@@ -81,6 +81,7 @@ size_t levenshtein(const std::vector<char> & a, const std::vector<char> & b) {
 struct Args {
     std::string tts_cli, t3, s3gen, ref_dir, whisper_cli, whisper_model;
     std::string lang = "en";
+    std::string kv_cache_type;  // "" -> tts-cli default (f32); "q8_0"/"f16" stress the quantized-KV align path
     std::string tmp = "/tmp";
     int    gpu_layers = 0;
     int    seed = 0;
@@ -107,6 +108,7 @@ bool parse_args(int argc, char ** argv, Args & a) {
         else if (k == "--seed")          { auto v = val(k.c_str()); if (!v) return false; a.seed = std::atoi(v); }
         else if (k == "--max-cer")       { auto v = val(k.c_str()); if (!v) return false; a.max_cer = std::atof(v); }
         else if (k == "--max-ramble")    { auto v = val(k.c_str()); if (!v) return false; a.max_ramble = std::atof(v); }
+        else if (k == "--kv-cache-type") { auto v = val(k.c_str()); if (!v) return false; a.kv_cache_type = v; }
         else { fprintf(stderr, "unknown arg: %s\n", k.c_str()); return false; }
     }
     return !a.tts_cli.empty() && !a.t3.empty() && !a.s3gen.empty() &&
@@ -136,6 +138,7 @@ int synth(const Args & a, const std::string & text, const std::string & wav) {
         + " --threads 16";
     if (!a.ref_dir.empty()) cmd += " --ref-dir " + sh_quote(a.ref_dir);
     if (a.gpu_layers > 0)   cmd += " --n-gpu-layers " + std::to_string(a.gpu_layers);
+    if (!a.kv_cache_type.empty()) cmd += " --kv-cache-type " + sh_quote(a.kv_cache_type);
     cmd += " >/dev/null 2>&1";
     return std::system(cmd.c_str());
 }
diff --git a/tts-cpp/test/test_kv_cache_type.cpp b/tts-cpp/test/test_kv_cache_type.cpp
@@ -66,6 +66,30 @@ int main() {
     CHECK(chatterbox_resolve_kv_type(cpu, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
               == GGML_TYPE_Q8_0, "cpu retains q8_0 KV");
 
+    // ---- MTL resolve (QVAC-19557): also probes the align-probe cast(q8->f32) ----
+    // The CPU backend supports the strided q8->f32 cast, so q8 is retained; a
+    // backend lacking that cast kernel would be downgraded to f32 (the branch
+    // that stops the single-backend MTL graph SIGABRT'ing at compute).  f32
+    // requests are unaffected.
+    CHECK(chatterbox_mtl_resolve_kv_type(cpu, GGML_TYPE_F32, head_dim, n_head, n_kv_head)
+              == GGML_TYPE_F32, "mtl resolve: f32 stays f32 on cpu");
+    CHECK(chatterbox_mtl_resolve_kv_type(cpu, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
+              == GGML_TYPE_Q8_0, "mtl resolve: cpu retains q8_0 (supports the cast)");
+    CHECK(chatterbox_mtl_resolve_kv_type(nullptr, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
+              == GGML_TYPE_F32, "mtl resolve: null backend -> f32");
+
+    // The cross-backend safety net — quantized + backend-can't-encode-the-cast ->
+    // f32 — can't be reached with a CPU/Metal backend (both encode the cast), so
+    // exercise the pure decision directly to cover the OpenCL/Adreno/Mali case.
+    CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_Q8_0, /*cast_supported=*/false)
+              == GGML_TYPE_F32, "mtl cast-fallback: q8 + no cast -> f32 (the non-Metal-GPU net)");
+    CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_Q8_0, /*cast_supported=*/true)
+              == GGML_TYPE_Q8_0, "mtl cast-fallback: q8 + cast supported -> q8");
+    CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_F16, /*cast_supported=*/false)
+              == GGML_TYPE_F16, "mtl cast-fallback: f16 (non-quantized) unaffected");
+    CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_F32, /*cast_supported=*/false)
+              == GGML_TYPE_F32, "mtl cast-fallback: f32 unaffected");
+
     ggml_backend_free(cpu);
 
     if (g_failures) {
diff --git a/tts-cpp/test/test_metal_ops.cpp b/tts-cpp/test/test_metal_ops.cpp
@@ -335,6 +335,73 @@ static int test_mul_mm_fused(ggml_backend_t cpu, ggml_backend_t gpu,
     return 1;
 }
 
+// QVAC-19557: regression sentinel for the MTL Metal q8-KV SIGABRT.  With a
+// quantized KV cache, the multilingual Chatterbox variant's per-(layer,head)
+// alignment probe (build_llama_block) read a strided view of the q8 K cache and
+// CONT'd it to feed a mul_mat.  ggml-metal has no CONT kernel for quantized
+// tensors, so that op is unsupported on Metal — and because the MTL path runs a
+// single-backend graph_compute (no scheduler fallback) it crashed at encode
+// time.  The fix replaced that ggml_cont with a dequantizing ggml_cast to f32.
+// This test pins the two ggml facts the fix depends on:
+//   1. CONT(q8_0 strided) is STILL unsupported on Metal — i.e. the plain cont we
+//      removed really would crash (if this ever flips, the cast can become a
+//      cheaper cont again).
+//   2. CAST(q8_0 strided -> f32) IS supported on Metal — the op the fix relies
+//      on.  If this ever regresses, the align probe would crash again, so the
+//      test must fail loudly.
+// CPU must support both (the MTL variant also runs on CPU).
+static int test_quantized_cont_unsupported(ggml_backend_t cpu, ggml_backend_t gpu) {
+    fprintf(stderr, "[quantized_cont] ");
+    // Mirror the REAL op exactly: the align probe (build_llama_block) and the
+    // chatterbox_mtl_resolve_kv_type probe both build a strided 2D
+    // [head_dim, n_text] view of the token-major K cache, stride = one token row
+    // (ggml_row_size(t, head_dim * n_kv_head)).  Keep all three shapes identical.
+    auto make_view = [](ggml_context * ctx, ggml_type t) {
+        const int HD = 64, NKV = 16;
+        const size_t tok_row = ggml_row_size(t, (size_t) HD * NKV);
+        ggml_tensor * cache = ggml_new_tensor_1d(ctx, t, (int64_t) HD * NKV * 8);
+        return ggml_view_2d(ctx, cache, HD, 4, tok_row, 0);
+    };
+    auto supports_cont = [&](ggml_backend_t b, ggml_type t) {
+        ggml_init_params p = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
+        ggml_context * ctx = ggml_init(p);
+        bool sup = ggml_backend_supports_op(b, ggml_cont(ctx, make_view(ctx, t)));
+        ggml_free(ctx);
+        return sup;
+    };
+    auto supports_cast_f32 = [&](ggml_backend_t b, ggml_type t) {
+        ggml_init_params p = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
+        ggml_context * ctx = ggml_init(p);
+        bool sup = ggml_backend_supports_op(b, ggml_cast(ctx, make_view(ctx, t), GGML_TYPE_F32));
+        ggml_free(ctx);
+        return sup;
+    };
+    int fails = 0;
+    if (supports_cont(gpu, GGML_TYPE_Q8_0)) {
+        fprintf(stderr, "\n  NOTE: Metal now advertises CONT(q8_0) — the align-probe cast "
+                        "could be simplified back to a cont (not a failure, but revisit)\n");
+        // informational only; not a hard failure
+    }
+    if (!supports_cast_f32(gpu, GGML_TYPE_Q8_0)) {
+        fprintf(stderr, "\n  FAIL: Metal CAST(q8_0 strided -> f32) unsupported — the align-probe "
+                        "dequant fix (build_llama_block) would SIGABRT again\n");
+        ++fails;
+    }
+    if (!supports_cast_f32(cpu, GGML_TYPE_Q8_0)) {
+        fprintf(stderr, "\n  FAIL: CPU CAST(q8_0 strided -> f32) unsupported — MTL on CPU would break\n");
+        ++fails;
+    }
+    if (!supports_cont(cpu, GGML_TYPE_Q8_0)) {
+        fprintf(stderr, "\n  FAIL: CPU CONT(q8_0) unsupported (unexpected)\n");
+        ++fails;
+    }
+    if (!fails) {
+        fprintf(stderr, "ok (Metal CAST(q8_0->f32) supported; the align-probe dequant fix holds)\n");
+        return 0;
+    }
+    return 1;
+}
+
 int main() {
     ggml_backend_t cpu = ggml_backend_cpu_init();
     if (!cpu) { fprintf(stderr, "CPU backend init failed\n"); return 1; }
@@ -350,6 +417,7 @@ int main() {
     }
 
     int rc = 0;
+    rc |= test_quantized_cont_unsupported(cpu, gpu);
     rc |= test_diag_mask_inf(cpu, gpu);
     rc |= test_pad_ext(cpu, gpu);
     // HiFT-sized shapes:
diff --git a/tts-cpp/test/test_multilingual_synth.cpp b/tts-cpp/test/test_multilingual_synth.cpp