Skip to content

Commit 12749a0

Browse files
ogad-tetherclaude
andcommitted
tts-cpp: chatterbox-mtl — run quantized KV on the GPU (dequant align-probe cast)
The multilingual (MTL) Chatterbox variant SIGABRTs during synthesis on the Metal backend with a quantized (q8_0) KV cache: ggml_metal_op_encode_impl: error: unsupported op 'CONT' (in eval_step_mtl) Root cause (it was NOT flash-attention): flash_attn_ext reads the q8 strided K/V cache fine on Metal. Walking the decode graph showed the only Metal-unsupported op was the per-(layer,head) alignment probe in build_llama_block, which ggml_cont'd a strided view of the q8 K cache to feed a mul_mat. ggml-metal has no CONT kernel for quantized tensors, and the MTL path runs a single-backend graph_compute (no scheduler fallback), so it crashed at encode time. The capability probe in chatterbox_resolve_kv_type only validates flash_attn_ext, not this CONT — which is why ggml-org#2527 (q8 KV as the default) shipped a broken MTL Metal path undetected. Fix: replace that same-type ggml_cont with a dequantizing ggml_cast(...->f32). Metal supports a dequantizing copy of a strided quantized view (verified against ggml-metal source: Q8_0->F32 routes through the supported CPY path, with no contiguity check, whereas the old q8->q8 cont hit `default: return false`). For an f32/f16 cache the cast degrades to a cheap cont/upcast. This recovers q8 KV on the GPU — pure memory savings, no compute cost (ggml-metal's flash-attn runs its matmul at f16 internally regardless of KV storage dtype). Cross-backend safety net: removing the blanket "f32 on any non-CPU backend" guard exposes q8 KV to all GPU backends. chatterbox_mtl_resolve_kv_type now probes the exact align-probe cast op per-backend and falls back to f32 when the backend can't encode it (e.g. thin-op OpenCL/Adreno or Mali-Vulkan builds), instead of a name/type check. Vulkan quantized K/V stays force-f32'd in chatterbox_resolve_kv_type (coopmat2). The pure decision is factored into chatterbox_mtl_kv_type_for_cast_support so the fallback branch is unit-testable. Tests: - test_kv_cache_type: chatterbox_mtl_resolve_kv_type pass-through + the cross-backend fallback branch (cast unsupported -> f32) via the pure helper. - test_metal_ops (gpu): CAST(q8_0 strided -> f32) is supported on Metal and CONT(q8_0) is not — same 2D strided shape the align probe and resolve probe use, so the sentinel mirrors the real op. - test_multilingual_synth: --kv-cache-type passthrough + mtl-synth-q8-<lang> ctest variants (en/ar/ru/hi). Missing GGUFs -> SKIP (77), not fail. - test-eos-roundtrip-q8-kv: CER/ramble round-trip under a q8_0 KV cache to catch alignment/EOS drift from the dequant (WAV-sanity tests can't see it). Validated on macOS Metal (M2) and on-device iOS (iPhone 17 Pro Max, A19 Pro Metal): q8 MTL synthesizes across es/fr/de/pt with no CONT crash; q8-vs-f32 perf is within run-to-run noise (the change is perf-neutral on the f32/f16 paths and q8 KV is no slower than f16). Refs QVAC-19557 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 4c8767a commit 12749a0

8 files changed

Lines changed: 262 additions & 5 deletions

tts-cpp/CMakeLists.txt

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -733,6 +733,7 @@ if (TTS_CPP_BUILD_TESTS)
733733
)
734734
set_tests_properties(mtl-synth-${_lang} PROPERTIES
735735
LABELS "multilingual;mtl-${_tier}"
736+
SKIP_RETURN_CODE 77
736737
TIMEOUT 180
737738
)
738739
if (_mtl_asr_enabled)
@@ -754,6 +755,44 @@ if (TTS_CPP_BUILD_TESTS)
754755
endif()
755756
endforeach()
756757

758+
# QVAC-19557: q8_0 KV-cache regression for the MTL variant. A quantized KV
759+
# cache used to SIGABRT on Metal ("unsupported op 'CONT'") via the alignment
760+
# probe; the dequantizing-cast fix recovers q8 KV on the GPU. These run the
761+
# full synth with --kv-cache-type q8_0 and validate the WAV is real audio
762+
# (RMS/peak/clipping/duration) — an end-to-end regression the op-level
763+
# test_metal_ops sentinel can't cover. A diverse script subset
764+
# (Latin/Arabic/Cyrillic/Devanagari) doubles as cross-language coverage.
765+
# IMPORTANT: these only exercise the *original crash path* on a Metal/GPU
766+
# fleet — on a CPU-only runner q8 KV never crashed, so q8-vs-f32 there only
767+
# checks the dequant-cast is numerically sane, not the SIGABRT. Labelled
768+
# `mtl-q8;gpu` so CI selects them on a Metal runner (`ctest -L mtl-q8`); the
769+
# alignment-drift regression (CER) is covered by test-eos-roundtrip-q8-kv
770+
# below. Missing GGUFs -> SKIP (77), not fail, like the f32 tests above.
771+
set(_mtl_q8_phrases
772+
"en|Hello, this is a multilingual text-to-speech test."
773+
"ar|مرحبًا، هذا اختبار متعدد اللغات لتحويل النص إلى كلام."
774+
"ru|Привет, это многоязычный тест синтеза речи."
775+
"hi|नमस्ते, यह एक बहुभाषी वाक् संश्लेषण परीक्षण है।"
776+
)
777+
foreach(_entry IN LISTS _mtl_q8_phrases)
778+
string(REPLACE "|" ";" _parts "${_entry}")
779+
list(GET _parts 0 _lang)
780+
list(GET _parts 1 _text)
781+
add_test(
782+
NAME mtl-synth-q8-${_lang}
783+
COMMAND $<TARGET_FILE:test-multilingual-synth>
784+
--lang "${_lang}"
785+
--text "${_text}"
786+
--kv-cache-type q8_0
787+
--out "${_mtl_out_dir}/${_lang}-q8.wav"
788+
)
789+
set_tests_properties(mtl-synth-q8-${_lang} PROPERTIES
790+
LABELS "multilingual;mtl-q8;gpu"
791+
SKIP_RETURN_CODE 77
792+
TIMEOUT 180
793+
)
794+
endforeach()
795+
757796
# End-to-end EOS round-trip regression. Drives tts-cli to
758797
# synthesize a set of English phrases, transcribes with whisper-cli, and
759798
# asserts the transcription is close to the input (CER guard -> catches
@@ -790,6 +829,25 @@ if (TTS_CPP_BUILD_TESTS)
790829
--lang "en"
791830
--tmp "${_mtl_out_dir}"
792831
REQUIRES "${_tcb_t3_mtl_q4_gguf}" "${_tcb_s3gen_mtl_gguf}")
832+
833+
# QVAC-19557: same EOS round-trip but with a q8_0 KV CACHE (not a quantized
834+
# T3). The align-probe fix feeds the alignment mul_mat *dequantized* q8
835+
# keys (ggml_cast), which could shift the argmax/softmax over text columns
836+
# and thus EOS timing — something the WAV-sanity mtl-synth-q8-* tests can't
837+
# see. This drives the CER/ramble guards under q8 KV to catch alignment
838+
# drift (clipping / early-cutoff / rambling) vs the f32 baseline above.
839+
tts_cpp_register_test(test-eos-roundtrip-q8-kv
840+
EXE test-eos-roundtrip
841+
LABEL "fixture;asr"
842+
ARGS --tts-cli "$<TARGET_FILE:tts-cli>"
843+
--t3 "${_tcb_t3_mtl_gguf}"
844+
--s3gen "${_tcb_s3gen_mtl_gguf}"
845+
--whisper-cli "${WHISPER_CLI}"
846+
--whisper-model "${WHISPER_MODEL}"
847+
--lang "en"
848+
--kv-cache-type q8_0
849+
--tmp "${_mtl_out_dir}"
850+
REQUIRES "${_tcb_t3_mtl_gguf}" "${_tcb_s3gen_mtl_gguf}")
793851
else()
794852
message(STATUS "tts-cpp: test-eos-roundtrip built but not registered (needs whisper-cli + model + tts-cli)")
795853
endif()

tts-cpp/src/chatterbox_t3_internal.h

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,21 @@ ggml_type chatterbox_kv_type_from_str(const std::string & s);
119119
ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
120120
int head_dim, int n_head, int n_kv_head);
121121

122+
// MTL-variant resolve: chatterbox_resolve_kv_type plus a probe of the extra
123+
// quantized-cache op the multilingual decode graph emits — the alignment
124+
// probe's dequantizing cast of a strided q8 K-cache view to f32
125+
// (build_llama_block). Returns f32 when the backend can't encode that cast, so
126+
// q8 KV stays enabled on backends that support it (Metal) and safely degrades on
127+
// those that don't, without the single-backend MTL graph SIGABRT'ing at compute.
128+
ggml_type chatterbox_mtl_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
129+
int head_dim, int n_head, int n_kv_head);
130+
131+
// Pure decision behind chatterbox_mtl_resolve_kv_type's cross-backend safety net,
132+
// exposed for unit testing: returns GGML_TYPE_F32 when `resolved` is quantized
133+
// and the backend cannot encode the align-probe's strided q8->f32 cast, else
134+
// returns `resolved` unchanged.
135+
ggml_type chatterbox_mtl_kv_type_for_cast_support(ggml_type resolved, bool cast_supported);
136+
122137
struct gpt2_layer {
123138
ggml_tensor * ln_1_g = nullptr;
124139
ggml_tensor * ln_1_b = nullptr;

tts-cpp/src/main.cpp

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,54 @@ ggml_type chatterbox_resolve_kv_type(ggml_backend_t backend, ggml_type requested
402402
return requested;
403403
}
404404

405+
ggml_type chatterbox_mtl_kv_type_for_cast_support(ggml_type resolved, bool cast_supported) {
406+
// Pure decision split out of chatterbox_mtl_resolve_kv_type so the
407+
// cross-backend safety net is unit-testable without a GPU that actually
408+
// lacks the cast: a quantized KV type is downgraded to f32 when the backend
409+
// cannot encode the align-probe's strided q8->f32 cast; non-quantized types
410+
// (and the cast-supported case) pass through unchanged.
411+
if (ggml_is_quantized(resolved) && !cast_supported) return GGML_TYPE_F32;
412+
return resolved;
413+
}
414+
415+
ggml_type chatterbox_mtl_resolve_kv_type(ggml_backend_t backend, ggml_type requested,
416+
int head_dim, int n_head, int n_kv_head) {
417+
// Start from the shared resolve (flash_attn_ext probe + Vulkan coopmat2
418+
// force-f32). The MTL decode graph emits one MORE quantized-cache op the
419+
// shared probe doesn't cover: the per-(layer,head) alignment probe
420+
// dequantizes a STRIDED view of the quantized K cache via ggml_cast(...->f32)
421+
// (build_llama_block). ggml-metal supports that cast (which is why q8 KV now
422+
// runs on Metal), but a GPU backend with thinner op coverage
423+
// (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can advertise q8 flash-attn
424+
// yet be unable to encode the strided q8->f32 cast — and the MTL path runs a
425+
// single-backend graph_compute with no scheduler fallback, so that would
426+
// SIGABRT at compute. Probe the cast op directly and fall back to f32 when
427+
// the backend can't encode it, instead of the old blanket "force f32 on any
428+
// non-CPU backend" guard (which also blocked Metal, the whole bug).
429+
ggml_type t = chatterbox_resolve_kv_type(backend, requested, head_dim, n_head, n_kv_head);
430+
if (!ggml_is_quantized(t) || !backend) return t;
431+
432+
bool cast_ok = false;
433+
ggml_init_params pp = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
434+
if (ggml_context * pc = ggml_init(pp)) {
435+
// Mirror the align probe: a strided [head_dim, k] view of the token-major
436+
// q8 cache, cast to f32. Strides come from ggml_row_size so the view is
437+
// block-aligned exactly as build_llama_block builds it.
438+
const size_t tok_row = ggml_row_size(t, (size_t) head_dim * n_kv_head);
439+
ggml_tensor * cache = ggml_new_tensor_1d(pc, t, (int64_t) head_dim * n_kv_head * 8);
440+
ggml_tensor * view = ggml_view_2d(pc, cache, head_dim, 4, tok_row, 0);
441+
ggml_tensor * cast = ggml_cast(pc, view, GGML_TYPE_F32);
442+
cast_ok = (cast != nullptr) && ggml_backend_supports_op(backend, cast);
443+
ggml_free(pc);
444+
}
445+
const ggml_type eff = chatterbox_mtl_kv_type_for_cast_support(t, cast_ok);
446+
if (eff != t) {
447+
fprintf(stderr, "chatterbox(mtl): backend cannot encode the quantized-KV alignment "
448+
"cast (%s strided -> f32); using f32 KV cache\n", ggml_type_name(t));
449+
}
450+
return eff;
451+
}
452+
405453
bool load_model_gguf(const std::string & path, chatterbox_model & model, int requested_ctx, int n_gpu_layers, ggml_type kv_type) {
406454
{
407455
gguf_init_params peek_params = { /*.no_alloc=*/ true, /*.ctx=*/ nullptr };

tts-cpp/src/t3_mtl.cpp

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -714,7 +714,16 @@ ggml_tensor * build_llama_block(ggml_context * ctx, ggml_cgraph * gf,
714714
layer_off + (size_t) probe_head * kv_head_row
715715
+ (size_t) ti * kv_tok_row);
716716
}
717-
k_text = ggml_cont(ctx, k_text);
717+
// Dequantize-and-pack the text-key slice to contiguous f32 for the
718+
// mul_mat. When the KV cache is quantized (q8_0), a plain ggml_cont
719+
// would produce a quantized CONT, which ggml-metal can't encode (no
720+
// quantized CONT kernel) — it SIGABRTs the whole decode. ggml_cast
721+
// to f32 is a dequantizing copy that Metal *does* support for a
722+
// strided quantized view (QVAC-19557), and for an f32/f16 cache it
723+
// degrades to a cheap cont/upcast. This slice is tiny (HD × n_text
724+
// for one head) and off the hot path, so the dequant cost is
725+
// negligible.
726+
k_text = ggml_cast(ctx, k_text, GGML_TYPE_F32);
718727
ggml_tensor * scores = ggml_mul_mat(ctx, k_text, q_h); // (n_text, N)
719728
scores = ggml_scale(ctx, scores, 1.0f / std::sqrt((float) HD));
720729
ggml_tensor * aprobs = ggml_soft_max(ctx, scores); // softmax over n_text
@@ -1828,8 +1837,18 @@ bool load_model_gguf_mtl(const std::string & path,
18281837
// kv_layer_elems * sizeof(float).
18291838
// Fall back to F32 KV if the resolved backend can't run flash
18301839
// attention with the requested quantized/f16 K/V.
1831-
hp.kv_type = chatterbox_resolve_kv_type(model.backend, kv_type,
1832-
hp.head_dim, hp.n_head, hp.n_kv_head);
1840+
// QVAC-19557: a quantized (q8_0) KV cache used to SIGABRT on Metal
1841+
// ("unsupported op 'CONT'"). The cause was NOT flash-attention (which
1842+
// reads the q8 strided cache fine on Metal) but the per-(layer,head)
1843+
// alignment probe in build_llama_block, which ggml_cont'd a strided view
1844+
// of the quantized K cache to feed a mul_mat — and ggml-metal has no CONT
1845+
// kernel for quantized tensors. That cont is now a dequantizing
1846+
// ggml_cast to f32 (Metal-supported), so quantized K/V runs on the GPU.
1847+
// chatterbox_mtl_resolve_kv_type probes that cast per-backend and falls
1848+
// back to f32 on any GPU backend that can't encode it (Vulkan coopmat2 is
1849+
// separately force-f32'd inside the shared resolve).
1850+
hp.kv_type = chatterbox_mtl_resolve_kv_type(model.backend, kv_type,
1851+
hp.head_dim, hp.n_head, hp.n_kv_head);
18331852
ggml_init_params kv_params = { ggml_tensor_overhead() * 4, nullptr, true };
18341853
model.ctx_kv = ggml_init(kv_params);
18351854
const int64_t kv_elements_b2 =

tts-cpp/test/test_eos_roundtrip.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,7 @@ size_t levenshtein(const std::vector<char> & a, const std::vector<char> & b) {
8181
struct Args {
8282
std::string tts_cli, t3, s3gen, ref_dir, whisper_cli, whisper_model;
8383
std::string lang = "en";
84+
std::string kv_cache_type; // "" -> tts-cli default (f32); "q8_0"/"f16" stress the quantized-KV align path
8485
std::string tmp = "/tmp";
8586
int gpu_layers = 0;
8687
int seed = 0;
@@ -107,6 +108,7 @@ bool parse_args(int argc, char ** argv, Args & a) {
107108
else if (k == "--seed") { auto v = val(k.c_str()); if (!v) return false; a.seed = std::atoi(v); }
108109
else if (k == "--max-cer") { auto v = val(k.c_str()); if (!v) return false; a.max_cer = std::atof(v); }
109110
else if (k == "--max-ramble") { auto v = val(k.c_str()); if (!v) return false; a.max_ramble = std::atof(v); }
111+
else if (k == "--kv-cache-type") { auto v = val(k.c_str()); if (!v) return false; a.kv_cache_type = v; }
110112
else { fprintf(stderr, "unknown arg: %s\n", k.c_str()); return false; }
111113
}
112114
return !a.tts_cli.empty() && !a.t3.empty() && !a.s3gen.empty() &&
@@ -136,6 +138,7 @@ int synth(const Args & a, const std::string & text, const std::string & wav) {
136138
+ " --threads 16";
137139
if (!a.ref_dir.empty()) cmd += " --ref-dir " + sh_quote(a.ref_dir);
138140
if (a.gpu_layers > 0) cmd += " --n-gpu-layers " + std::to_string(a.gpu_layers);
141+
if (!a.kv_cache_type.empty()) cmd += " --kv-cache-type " + sh_quote(a.kv_cache_type);
139142
cmd += " >/dev/null 2>&1";
140143
return std::system(cmd.c_str());
141144
}

tts-cpp/test/test_kv_cache_type.cpp

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,30 @@ int main() {
6666
CHECK(chatterbox_resolve_kv_type(cpu, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
6767
== GGML_TYPE_Q8_0, "cpu retains q8_0 KV");
6868

69+
// ---- MTL resolve (QVAC-19557): also probes the align-probe cast(q8->f32) ----
70+
// The CPU backend supports the strided q8->f32 cast, so q8 is retained; a
71+
// backend lacking that cast kernel would be downgraded to f32 (the branch
72+
// that stops the single-backend MTL graph SIGABRT'ing at compute). f32
73+
// requests are unaffected.
74+
CHECK(chatterbox_mtl_resolve_kv_type(cpu, GGML_TYPE_F32, head_dim, n_head, n_kv_head)
75+
== GGML_TYPE_F32, "mtl resolve: f32 stays f32 on cpu");
76+
CHECK(chatterbox_mtl_resolve_kv_type(cpu, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
77+
== GGML_TYPE_Q8_0, "mtl resolve: cpu retains q8_0 (supports the cast)");
78+
CHECK(chatterbox_mtl_resolve_kv_type(nullptr, GGML_TYPE_Q8_0, head_dim, n_head, n_kv_head)
79+
== GGML_TYPE_F32, "mtl resolve: null backend -> f32");
80+
81+
// The cross-backend safety net — quantized + backend-can't-encode-the-cast ->
82+
// f32 — can't be reached with a CPU/Metal backend (both encode the cast), so
83+
// exercise the pure decision directly to cover the OpenCL/Adreno/Mali case.
84+
CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_Q8_0, /*cast_supported=*/false)
85+
== GGML_TYPE_F32, "mtl cast-fallback: q8 + no cast -> f32 (the non-Metal-GPU net)");
86+
CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_Q8_0, /*cast_supported=*/true)
87+
== GGML_TYPE_Q8_0, "mtl cast-fallback: q8 + cast supported -> q8");
88+
CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_F16, /*cast_supported=*/false)
89+
== GGML_TYPE_F16, "mtl cast-fallback: f16 (non-quantized) unaffected");
90+
CHECK(chatterbox_mtl_kv_type_for_cast_support(GGML_TYPE_F32, /*cast_supported=*/false)
91+
== GGML_TYPE_F32, "mtl cast-fallback: f32 unaffected");
92+
6993
ggml_backend_free(cpu);
7094

7195
if (g_failures) {

tts-cpp/test/test_metal_ops.cpp

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -335,6 +335,73 @@ static int test_mul_mm_fused(ggml_backend_t cpu, ggml_backend_t gpu,
335335
return 1;
336336
}
337337

338+
// QVAC-19557: regression sentinel for the MTL Metal q8-KV SIGABRT. With a
339+
// quantized KV cache, the multilingual Chatterbox variant's per-(layer,head)
340+
// alignment probe (build_llama_block) read a strided view of the q8 K cache and
341+
// CONT'd it to feed a mul_mat. ggml-metal has no CONT kernel for quantized
342+
// tensors, so that op is unsupported on Metal — and because the MTL path runs a
343+
// single-backend graph_compute (no scheduler fallback) it crashed at encode
344+
// time. The fix replaced that ggml_cont with a dequantizing ggml_cast to f32.
345+
// This test pins the two ggml facts the fix depends on:
346+
// 1. CONT(q8_0 strided) is STILL unsupported on Metal — i.e. the plain cont we
347+
// removed really would crash (if this ever flips, the cast can become a
348+
// cheaper cont again).
349+
// 2. CAST(q8_0 strided -> f32) IS supported on Metal — the op the fix relies
350+
// on. If this ever regresses, the align probe would crash again, so the
351+
// test must fail loudly.
352+
// CPU must support both (the MTL variant also runs on CPU).
353+
static int test_quantized_cont_unsupported(ggml_backend_t cpu, ggml_backend_t gpu) {
354+
fprintf(stderr, "[quantized_cont] ");
355+
// Mirror the REAL op exactly: the align probe (build_llama_block) and the
356+
// chatterbox_mtl_resolve_kv_type probe both build a strided 2D
357+
// [head_dim, n_text] view of the token-major K cache, stride = one token row
358+
// (ggml_row_size(t, head_dim * n_kv_head)). Keep all three shapes identical.
359+
auto make_view = [](ggml_context * ctx, ggml_type t) {
360+
const int HD = 64, NKV = 16;
361+
const size_t tok_row = ggml_row_size(t, (size_t) HD * NKV);
362+
ggml_tensor * cache = ggml_new_tensor_1d(ctx, t, (int64_t) HD * NKV * 8);
363+
return ggml_view_2d(ctx, cache, HD, 4, tok_row, 0);
364+
};
365+
auto supports_cont = [&](ggml_backend_t b, ggml_type t) {
366+
ggml_init_params p = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
367+
ggml_context * ctx = ggml_init(p);
368+
bool sup = ggml_backend_supports_op(b, ggml_cont(ctx, make_view(ctx, t)));
369+
ggml_free(ctx);
370+
return sup;
371+
};
372+
auto supports_cast_f32 = [&](ggml_backend_t b, ggml_type t) {
373+
ggml_init_params p = { ggml_tensor_overhead() * 8, nullptr, /*no_alloc=*/true };
374+
ggml_context * ctx = ggml_init(p);
375+
bool sup = ggml_backend_supports_op(b, ggml_cast(ctx, make_view(ctx, t), GGML_TYPE_F32));
376+
ggml_free(ctx);
377+
return sup;
378+
};
379+
int fails = 0;
380+
if (supports_cont(gpu, GGML_TYPE_Q8_0)) {
381+
fprintf(stderr, "\n NOTE: Metal now advertises CONT(q8_0) — the align-probe cast "
382+
"could be simplified back to a cont (not a failure, but revisit)\n");
383+
// informational only; not a hard failure
384+
}
385+
if (!supports_cast_f32(gpu, GGML_TYPE_Q8_0)) {
386+
fprintf(stderr, "\n FAIL: Metal CAST(q8_0 strided -> f32) unsupported — the align-probe "
387+
"dequant fix (build_llama_block) would SIGABRT again\n");
388+
++fails;
389+
}
390+
if (!supports_cast_f32(cpu, GGML_TYPE_Q8_0)) {
391+
fprintf(stderr, "\n FAIL: CPU CAST(q8_0 strided -> f32) unsupported — MTL on CPU would break\n");
392+
++fails;
393+
}
394+
if (!supports_cont(cpu, GGML_TYPE_Q8_0)) {
395+
fprintf(stderr, "\n FAIL: CPU CONT(q8_0) unsupported (unexpected)\n");
396+
++fails;
397+
}
398+
if (!fails) {
399+
fprintf(stderr, "ok (Metal CAST(q8_0->f32) supported; the align-probe dequant fix holds)\n");
400+
return 0;
401+
}
402+
return 1;
403+
}
404+
338405
int main() {
339406
ggml_backend_t cpu = ggml_backend_cpu_init();
340407
if (!cpu) { fprintf(stderr, "CPU backend init failed\n"); return 1; }
@@ -350,6 +417,7 @@ int main() {
350417
}
351418

352419
int rc = 0;
420+
rc |= test_quantized_cont_unsupported(cpu, gpu);
353421
rc |= test_diag_mask_inf(cpu, gpu);
354422
rc |= test_pad_ext(cpu, gpu);
355423
// HiFT-sized shapes:

0 commit comments

Comments
 (0)