You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tts-cpp: chatterbox-mtl — probe the align-cast op per-backend (close guard-removal gap)
An adversarial audit of PR #71 flagged that fully removing chatterbox_mtl_guard_kv_type
deleted the blanket "force f32 on any non-CPU backend" net, so a quantized KV
request now reaches ALL GPU backends for the MTL variant. The shared
chatterbox_resolve_kv_type only probes flash_attn_ext — NOT the dequantizing
ggml_cast(q8_0 strided -> f32) the alignment probe emits every decode step. A GPU
backend with thin op coverage (e.g. some OpenCL/Adreno or Mali-Vulkan builds) can
advertise q8 flash-attn yet be unable to encode that cast, and because the MTL
path runs a single-backend graph_compute (no scheduler fallback) it would SIGABRT
at compute — i.e. removing the guard could trade the Metal crash for a crash on
another backend.
Fix: chatterbox_mtl_resolve_kv_type wraps the shared resolve and additionally
probes the strided q8->f32 cast via ggml_backend_supports_op, falling back to f32
only when the backend can't encode it. This is per-backend-correct: Metal (which
supports the cast — verified) keeps q8 on the GPU, and any backend lacking the
kernel safely degrades to f32 instead of crashing. Replaces the blunt
"non-CPU -> f32" guard, which also blocked Metal (the original bug).
Validated (stock ggml Metal, M2): q8 MTL on Metal still retains q8 (no fallback,
no crash, byte-identical sample count). test_kv_cache_type extended for the new
resolve (cpu retains q8 / null -> f32 / f32 stays f32).
Refs QVAC-19557
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0 commit comments