Skip to content

Commit 5b398bd

Browse files
committed
Fix palettize_weights with enable_per_channel_scale=True crashing on ANE (macOS 26)
When OpPalettizerConfig is configured with enable_per_channel_scale=True, palettize_weights wraps the constexpr_lut_to_dense output in a constexpr_blockwise_shift_scale op (data=<dense fp16 weight>, scale=<per-channel fp16>). On macOS 26, the MPSGraph backend lowering for that constexpr op fails verification when targeting the Apple Neural Engine: 'mps.dequantize' op operand #2 must be tensor of quantized values, but got 'tensor<1xf16>' ... failed assertion `original module failed verification' The MPSGraph lowering of constexpr_blockwise_shift_scale assumes the data operand is a quantized integer tensor (it lowers to mps.dequantize); with enable_per_channel_scale=True, the data is the dense fp16 weight, which fails that assumption. CPU and GPU compute units accept the wrapper and predict correctly; only the ANE-targeted MIL -> MPSGraph dispatch is broken. Fix: bake per_channel_scale into the LUT entries at compile time and re-emit constexpr_lut_to_dense, instead of leaving the scale as a runtime constexpr. Both data and scale are fp16 and the wrapper's only effect is data * scale, so the fold is mathematically identical. The failing MPSGraph dispatch is eliminated entirely, and CPU / GPU numerics stay bit-identical with the prior behavior. Resulting graph also has one fewer runtime constexpr per palettized const. Test updated: TestPalettizeWeights::test_palettization_pcs previously asserted that the constexpr_blockwise_shift_scale wrapper was emitted; it now asserts the wrapper is absent (the LUT is pre-scaled). Numerical equivalence vs the unpalettized model is verified by the existing verify_model_outputs call on macOS 15+. Tested: - test_palettization_pcs: PASS - All 155 TestPalettizeWeights / TestJointCompressWeights: PASS - Manual: Qwen3-VL 2B stateful chunk on macOS 26 + M4 ANE: MPSGraph verification crash gone (was reproducible at every load).
1 parent e95804f commit 5b398bd

2 files changed

Lines changed: 41 additions & 9 deletions

File tree

coremltools/optimize/coreml/_quantization_passes.py

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1139,12 +1139,39 @@ def transform_op(self, op: Operation):
11391139
"Palettization with per-channel-scale is only supported since "
11401140
"iOS18. Please set minimum_deployment_target accordingly."
11411141
)
1142-
new_var = mb.constexpr_blockwise_shift_scale(
1143-
data=new_var,
1144-
scale=per_channel_scale,
1145-
offset=None,
1146-
before_op=op,
1142+
# Bake per_channel_scale into the LUT entries at compile time
1143+
# and re-emit constexpr_lut_to_dense, rather than wrapping the
1144+
# dense fp16 weight in a runtime constexpr_blockwise_shift_scale.
1145+
# The wrapper makes the MPSGraph backend on macOS 26 fail
1146+
# MPSGraph verification at model load time on the Apple Neural
1147+
# Engine ("'mps.dequantize' op operand #2 must be tensor of
1148+
# quantized values, but got 'tensor<1xf16>'"), because the MPS
1149+
# lowering of that constexpr expects the data operand to be a
1150+
# quantized integer tensor; with per_channel_scale the data is
1151+
# the dense fp16 weight produced by constexpr_lut_to_dense.
1152+
# Folding scale into the LUT entries is mathematically identical
1153+
# (both data and scale are fp16, output is data * scale), so
1154+
# numerics are preserved while the failing runtime op is
1155+
# eliminated. CPU and GPU compute units are unaffected by the
1156+
# previous wrapper, so this also keeps their numerics
1157+
# bit-identical with the prior behavior.
1158+
lut = lut_params.lut.copy()
1159+
# per_channel_scale rank matches the original weight rank; LUT
1160+
# has additional trailing dims for [group, num_palette,
1161+
# vector_size]. Broadcast scale across those trailing dims.
1162+
pcs_bcast = per_channel_scale.reshape(
1163+
per_channel_scale.shape
1164+
+ (1,) * (lut.ndim - per_channel_scale.ndim)
1165+
)
1166+
lut = (
1167+
lut.astype(np.float32) * pcs_bcast.astype(np.float32)
1168+
).astype(lut.dtype)
1169+
new_var = frontend_utils._construct_constexpr_lut_op(
1170+
lut_params.indices,
1171+
lut,
1172+
lut_params.vector_axis,
11471173
name=op.name + "_palettized_pcs",
1174+
before_op=op,
11481175
)
11491176
else:
11501177
decompressed_val = self.decompress(lut_params)

coremltools/test/optimize/coreml/test_post_training_quantization.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1683,13 +1683,18 @@ def test_palettization_pcs(self, compute_unit, backend):
16831683
op_type="constexpr_lut_to_dense"
16841684
)[0]
16851685
assert types.builtin_to_string(palettize_op.indices.dtype) == "uint4"
1686-
# The per-channel-scale is represented by a quant op to do scaling.
1686+
# The per-channel-scale is folded into the LUT entries at compile time
1687+
# (rather than emitted as a runtime constexpr_blockwise_shift_scale
1688+
# wrapper); see the comment in
1689+
# coremltools/optimize/coreml/_quantization_passes.py for the rationale
1690+
# (the wrapper was rejected by the MPSGraph lowering on Apple Neural
1691+
# Engine on macOS 26 because mps.dequantize expects a quantized integer
1692+
# data input but received the dense fp16 weight). Folding into the LUT
1693+
# is mathematically identical and produces a smaller graph.
16871694
quantize_ops = mlmodel_palettized._mil_program.functions["main"].find_ops(
16881695
op_type="constexpr_blockwise_shift_scale"
16891696
)
1690-
assert len(quantize_ops) > 0
1691-
# Order of quant and lut op is determined by canonicalize_quantized_lut_pattern graph pass.
1692-
assert quantize_ops[0].outputs[0].child_ops[0].op_type == "constexpr_lut_to_dense"
1697+
assert len(quantize_ops) == 0
16931698

16941699
if _macos_version() >= (15, 0):
16951700
verify_model_outputs(mlmodel, mlmodel_palettized, coreml_input_values)

0 commit comments

Comments
 (0)