πDescribing the bug
Bug. softplus exhibits a hard, single-step output collapse on Apple Neural Engine in fp16: at x β 10.4 the output drops from β10.4 to 0.0 across one grid point β not gradual precision loss. (A fine sweep places the transition between x = 10.394 and x = 10.395; the report-wide "10.4077" is the 2000-point sweep's nearest grid point.)
Affected. Models using softplus (or nn.Mish, which is x * tanh(softplus(x))) routed to NE in fp16. CPU and GPU compute units are unaffected. The cliff appears well below the asymptotic regime where softplus(x) β x would justify any approximation.
Discovered while debugging fp16 precision in a KataGo-style network's nn.Mish activations; isolated to softplus via op-attribution probing (every elementary op of Mish was forced to NE; only softplus exhibits the discontinuity, while tanh and mul propagate the broken upstream value correctly).
Stack Trace
To Reproduce
Save the following script as repro_softplus_ne_cliff.py and run with python repro_softplus_ne_cliff.py on an Apple-silicon Mac.
"""Minimal repro: softplus has a hard fp16 cliff on Apple Neural Engine.
Reproduces a hard, single-step output collapse at x β 10.408.
Requires macOS with Apple Neural Engine (M1/M2/M3/M4).
Run: python repro_softplus_ne_cliff.py
"""
import os
import tempfile
import coremltools as ct
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from coremltools.models.compute_device import MLNeuralEngineComputeDevice
from coremltools.models.compute_plan import MLComputePlan
SPATIAL = 8
CHANNELS = 32
FLAT_DIM = SPATIAL * SPATIAL * CHANNELS # 8 * 8 * 32 = 2048
class M(nn.Module):
"""conv -> softplus -> flatten -> linear (pick-element head).
Topology chosen to attract NE routing for softplus on macOS14+ targets:
Conv2d(1->C, k=3, padding=same) followed by a Linear head with NE-friendly
shapes. Smaller topologies (e.g. 1x1 conv with no Linear) compile but
softplus stays on CPU.
"""
def __init__(self):
super().__init__()
self.conv = nn.Conv2d(1, CHANNELS, kernel_size=3, padding="same")
self.flatten = nn.Flatten()
self.fc = nn.Linear(FLAT_DIM, 16)
with torch.no_grad():
# Delta conv: conv output[k,i,j] = input[0,i,j] (kernel center only)
self.conv.weight.zero_()
self.conv.weight[:, 0, 1, 1] = 1.0
self.conv.bias.zero_()
# Pick-element head: fc.out[k] = flat[k] for k in 0..15
self.fc.weight.zero_()
self.fc.weight.fill_diagonal_(1.0)
self.fc.bias.zero_()
def forward(self, x):
return self.fc(self.flatten(F.softplus(self.conv(x))))
def _op_kind(operator_name):
"""Strip 'iosXX.' / 'macOSXX.' namespace prefix from a MIL op_type."""
return operator_name.split(".", 1)[1] if "." in operator_name else operator_name
def main():
model = M().eval()
# CPU sanity: forward computes softplus(x_test) at every output element.
x_test = 2.5
test_input = torch.full((1, 1, SPATIAL, SPATIAL), x_test, dtype=torch.float32)
with torch.no_grad():
cpu_out = model(test_input).numpy().flatten()
sp_expected = float(F.softplus(torch.tensor(x_test)).item())
assert np.allclose(cpu_out, sp_expected, atol=1e-5), (
f"CPU sanity failed: cpu_out[:4]={cpu_out[:4]} != softplus({x_test})={sp_expected}"
)
with tempfile.TemporaryDirectory() as d:
traced = torch.jit.trace(model, test_input)
mlm = ct.convert(
traced,
convert_to="mlprogram",
inputs=[ct.TensorType(name="x", shape=test_input.shape)],
minimum_deployment_target=ct.target.macOS14,
compute_precision=ct.precision.FLOAT16,
)
pkg = os.path.join(d, "m.mlpackage")
mlm.save(pkg)
loaded = ct.models.MLModel(pkg, compute_units=ct.ComputeUnit.CPU_AND_NE)
# Routing check β assert softplus dispatched to NE.
plan = MLComputePlan.load_from_path(
loaded.get_compiled_model_path(),
compute_units=ct.ComputeUnit.CPU_AND_NE,
)
(fn,) = plan.model_structure.program.functions.values()
ops = list(fn.block.operations)
sp_ops = [op for op in ops if _op_kind(op.operator_name) == "softplus"]
assert len(sp_ops) == 1, f"expected exactly 1 softplus op, got {len(sp_ops)}"
usage = plan.get_compute_device_usage_for_mlprogram_operation(sp_ops[0])
device_name = (
type(usage.preferred_compute_device).__name__ if usage else "unknown"
)
assert usage is not None and isinstance(
usage.preferred_compute_device, MLNeuralEngineComputeDevice
), f"softplus routed to {device_name}; this repro requires NE"
# Sweep x in [-15, 15] @ 2000 points, capture softplus output at each x.
out_name = loaded.get_spec().description.output[0].name
N = 2000
xs = np.linspace(-15.0, 15.0, N, dtype=np.float32)
ys = np.empty(N, dtype=np.float32)
for i, xi in enumerate(xs):
inp = np.full((1, 1, SPATIAL, SPATIAL), float(xi), dtype=np.float32)
ys[i] = loaded.predict({"x": inp})[out_name].flat[0]
sp_ref = np.maximum(xs, 0) + np.log1p(np.exp(-np.abs(xs)))
# Cliff: NE output collapses to ~0 while fp32 ref is large.
cliff_idx = np.where((sp_ref > 5.0) & (ys < 1.0))[0]
if cliff_idx.size:
i = cliff_idx[0]
print(
f"CLIFF: x={xs[i]:.4f} ne_out={ys[i]:.4f} fp32_ref={sp_ref[i]:.4f}"
)
else:
print("No cliff observed β please check NE actually engaged.")
if __name__ == "__main__":
main()
Expected output on an Apple-silicon Mac:
CLIFF: x=10.4077 ne_out=0.0000 fp32_ref=10.4077
Expected behavior
softplus(x) = log(1 + exp(x)) is mathematically continuous and monotonically increasing. For x in [10, 11], the fp32 reference values are β 10.408β11.000 (already in the asymptotic regime where softplus(x) β x to machine precision). Output should remain a smooth monotonic function across the full input range.
Actual behavior
Output drops from β10.408 to 0.0 in a single grid step at x β 10.4077 (cliff observed at 2000-point linear sweep over [-15, 15], so step size β 0.015). The transition is hard β adjacent grid points show β10.39 β 0.00. Increasing sweep resolution does not soften the transition; it locates it more precisely.
The cliff is specific to softplus, not the surrounding ops. Figure: each panel plots an op's NE-fp16 output against the fp32 reference computed on the actual NE upstream input that op received (not on a clean fp32 chain). This isolates "introduces error" from "propagates upstream error":
- Panel 1 β softplus. Reference:
fp32 softplus(x). Visible gap at x β 10.4 β the NE line collapses to 0 while the fp32 line continues to β 10.4. Inset zooms x β [10, 11] showing the discontinuity sharply.
- Panel 2 β tanh. Reference:
np.tanh(NE_softplus_output). Byte-for-byte overlap across the full range β including past x = 10.4 where both lines go to 0 together.
- Panel 3 β mul (Mish output). Reference:
x Β· NE_tanh_output. Byte-for-byte overlap across the full range.
Tanh and mul therefore compute correctly given the broken inputs they receive from softplus. They propagate the cliff; they do not introduce it.
System environment (please complete the following information):
- macOS: 26.3.1
- Hardware: Apple M3 Max
- coremltools: 9.0
- PyTorch: 2.11.0 (note: coremltools 9.0 documents tested-up-to torch 2.7.0; cliff appears unrelated to this β same behavior was reproduced under older toolchain versions during research)
- NumPy: 2.1.3
- Python: 3.11.15
Additional context
Workaround
A drop-in replacement using a numerically-stable softplus identity eliminates the cliff while keeping every elementary op on NE:
def softplus_safe(x):
# softplus(x) = relu(x) + log1p(exp(-|x|))
return torch.relu(x) + torch.log1p(torch.exp(-torch.abs(x)))
All sub-ops (relu, abs, mul, exp, log lowered with epsilon=1.0 for log1p, add) route to NE and produce continuous output across [-15, 15]. The cliff at x β 10.4 disappears:
Accuracy caveat. Errors are highly asymmetric in x (measured on a 4001-point sweep):
- Positive tail (x β [3, 7]): max relative error β 9.5 Γ 10β»β΄ β effectively exact.
- Moderate negative (x β [β7, β3]): max relative error β 0.40 at x β β6.5, with NE over-estimating softplus (e.g. x = β6: ne = 3.05 Γ 10β»Β³ vs ref = 2.48 Γ 10β»Β³). Mechanism is fp16 ULP rounding of
1 + small to the nearest representable value above 1 (i.e. 1 + kΒ·2β»ΒΉβ° for some k β₯ 1) inside log1p, not subnormal flushing.
- Far negative (x β² β8): NE returns hard 0 because
exp(-|x|) underflows fp16 normals; relative error is 1 by definition, but absolute error remains < 3.4 Γ 10β»β΄.
- Peak relative error overall: 1.12 at x β β7.63 (transition between the two negative regimes).
Net: absolute error stays bounded (β€ 4.2 Γ 10β»Β³) across the full range. The positive tail is essentially exact; the negative tail loses precision but the absolute miss is small enough for activation use. If you need precision at moderately negative x, prefer a different identity.
Why this works. The cliff is a dynamic-range failure inside the NE lowering of softplus. The exact mechanism we cannot confirm without internals, but the cliff position is highly suggestive: a fine 251-point sweep on [10.30, 10.55] places the transition between x = 10.394 (ne β 10.39) and x = 10.395 (ne = 0.0). The fp16-rounded value of 10.395 is 10.3984, which sits just above log(2^15) = 10.39721 β pointing to a 2^15-bounded internal representation rather than fp16's full 2^16 range (naΓ―ve log(1 + exp(x)) would not overflow fp16 until x β³ log(65504) β 11.09). The repro's 2000-point sweep reports x β 10.4077, but that's a grid artifact; the underlying transition is at 10.395. Either way, the safe identity uses exp(-|x|) which is bounded in (0, 1] and cannot overflow regardless of the internal precision used β so its correctness does not depend on the exact mechanism.
A converter-side fix could lower nn.Softplus / the MIL softplus op to this identity when targeting NE. We have not validated that across all softplus call sites; a maintainer should confirm scope.
Reproducibility on other Apple Silicon
The repro above was developed on M3 Max. NE op-acceptance rules differ across silicon generations, so the routing precondition does not hold uniformly.
M5 (community report, 2026-05-07): This repro does not trigger the NE precondition on M5 β softplus is routed to MLCPUComputeDevice, so the assertion fails before the cliff scan runs:
AssertionError: softplus routed to MLCPUComputeDevice; this repro requires NE
However, an M5 user confirmed mish (which internally uses softplus) does route to NE at specific shapes. Sweeping a (spatial, channel) matrix found ALL_NE routing at:
s=16, c=64
s=32, c β {16, 32, 64}
s=64, c β {4, 8, 16}
At s=32, c=32, the M5 NE produces an analogous fp16 cliff:
CLIFF: x=10.4077 ne_out=0.0000 fp32_ref=10.4077
This suggests the underlying fp16 limit in the NE softplus kernel at x β 10.4 is consistent across silicon generations, but whether the Core ML compiler routes a standalone softplus to NE differs between M3 Max and M5.
Note on the workaround: The relu(x) + log1p(exp(-|x|)) identity above was validated on M3 Max. Because NE routing rules differ on M5 (and likely on future silicon), the workaround's NE coverage and continuity should be re-tested per chip generation β confirming both that all sub-ops still route to NE on the target chip, and that no analogous cliff appears in the rewritten graph.
Related
Drafted by Claude Opus 4.7 and reviewed, confirmed, rephrased, and edited by Me.
πDescribing the bug
Bug.
softplusexhibits a hard, single-step output collapse on Apple Neural Engine in fp16: at x β 10.4 the output drops from β10.4 to 0.0 across one grid point β not gradual precision loss. (A fine sweep places the transition between x = 10.394 and x = 10.395; the report-wide "10.4077" is the 2000-point sweep's nearest grid point.)Affected. Models using
softplus(ornn.Mish, which isx * tanh(softplus(x))) routed to NE in fp16. CPU and GPU compute units are unaffected. The cliff appears well below the asymptotic regime wheresoftplus(x) β xwould justify any approximation.Discovered while debugging fp16 precision in a KataGo-style network's
nn.Mishactivations; isolated to softplus via op-attribution probing (every elementary op of Mish was forced to NE; only softplus exhibits the discontinuity, while tanh and mul propagate the broken upstream value correctly).Stack Trace
To Reproduce
Save the following script as
repro_softplus_ne_cliff.pyand run withpython repro_softplus_ne_cliff.pyon an Apple-silicon Mac.Expected output on an Apple-silicon Mac:
Expected behavior
softplus(x) = log(1 + exp(x))is mathematically continuous and monotonically increasing. For x in [10, 11], the fp32 reference values are β 10.408β11.000 (already in the asymptotic regime wheresoftplus(x) β xto machine precision). Output should remain a smooth monotonic function across the full input range.Actual behavior
Output drops from β10.408 to 0.0 in a single grid step at x β 10.4077 (cliff observed at 2000-point linear sweep over [-15, 15], so step size β 0.015). The transition is hard β adjacent grid points show β10.39 β 0.00. Increasing sweep resolution does not soften the transition; it locates it more precisely.
The cliff is specific to softplus, not the surrounding ops. Figure: each panel plots an op's NE-fp16 output against the fp32 reference computed on the actual NE upstream input that op received (not on a clean fp32 chain). This isolates "introduces error" from "propagates upstream error":
fp32 softplus(x). Visible gap at x β 10.4 β the NE line collapses to 0 while the fp32 line continues to β 10.4. Inset zooms x β [10, 11] showing the discontinuity sharply.np.tanh(NE_softplus_output). Byte-for-byte overlap across the full range β including past x = 10.4 where both lines go to 0 together.x Β· NE_tanh_output. Byte-for-byte overlap across the full range.Tanh and mul therefore compute correctly given the broken inputs they receive from softplus. They propagate the cliff; they do not introduce it.
System environment (please complete the following information):
Additional context
Workaround
A drop-in replacement using a numerically-stable softplus identity eliminates the cliff while keeping every elementary op on NE:
All sub-ops (
relu,abs,mul,exp,loglowered withepsilon=1.0forlog1p,add) route to NE and produce continuous output across [-15, 15]. The cliff at x β 10.4 disappears:Accuracy caveat. Errors are highly asymmetric in x (measured on a 4001-point sweep):
1 + smallto the nearest representable value above 1 (i.e.1 + kΒ·2β»ΒΉβ°for some k β₯ 1) insidelog1p, not subnormal flushing.exp(-|x|)underflows fp16 normals; relative error is 1 by definition, but absolute error remains < 3.4 Γ 10β»β΄.Net: absolute error stays bounded (β€ 4.2 Γ 10β»Β³) across the full range. The positive tail is essentially exact; the negative tail loses precision but the absolute miss is small enough for activation use. If you need precision at moderately negative x, prefer a different identity.
Why this works. The cliff is a dynamic-range failure inside the NE lowering of
softplus. The exact mechanism we cannot confirm without internals, but the cliff position is highly suggestive: a fine 251-point sweep on [10.30, 10.55] places the transition between x = 10.394 (ne β 10.39) and x = 10.395 (ne = 0.0). The fp16-rounded value of 10.395 is 10.3984, which sits just abovelog(2^15) = 10.39721β pointing to a 2^15-bounded internal representation rather than fp16's full 2^16 range (naΓ―velog(1 + exp(x))would not overflow fp16 until x β³log(65504) β 11.09). The repro's 2000-point sweep reports x β 10.4077, but that's a grid artifact; the underlying transition is at 10.395. Either way, the safe identity usesexp(-|x|)which is bounded in (0, 1] and cannot overflow regardless of the internal precision used β so its correctness does not depend on the exact mechanism.A converter-side fix could lower
nn.Softplus/ the MILsoftplusop to this identity when targeting NE. We have not validated that across all softplus call sites; a maintainer should confirm scope.Reproducibility on other Apple Silicon
The repro above was developed on M3 Max. NE op-acceptance rules differ across silicon generations, so the routing precondition does not hold uniformly.
M5 (community report, 2026-05-07): This repro does not trigger the NE precondition on M5 β
softplusis routed toMLCPUComputeDevice, so the assertion fails before the cliff scan runs:However, an M5 user confirmed
mish(which internally usessoftplus) does route to NE at specific shapes. Sweeping a(spatial, channel)matrix found ALL_NE routing at:s=16, c=64s=32, c β {16, 32, 64}s=64, c β {4, 8, 16}At
s=32, c=32, the M5 NE produces an analogous fp16 cliff:This suggests the underlying fp16 limit in the NE softplus kernel at x β 10.4 is consistent across silicon generations, but whether the Core ML compiler routes a standalone
softplusto NE differs between M3 Max and M5.Related
Drafted by Claude Opus 4.7 and reviewed, confirmed, rephrased, and edited by Me.