Skip to content

Softplus on Apple Neural Engine has a hard fp16 discontinuity at x β‰ˆ 10.4 (output drops to 0)Β #2687

@ChinChangYang

Description

@ChinChangYang

🐞Describing the bug

Bug. softplus exhibits a hard, single-step output collapse on Apple Neural Engine in fp16: at x β‰ˆ 10.4 the output drops from β‰ˆ10.4 to 0.0 across one grid point β€” not gradual precision loss. (A fine sweep places the transition between x = 10.394 and x = 10.395; the report-wide "10.4077" is the 2000-point sweep's nearest grid point.)

Affected. Models using softplus (or nn.Mish, which is x * tanh(softplus(x))) routed to NE in fp16. CPU and GPU compute units are unaffected. The cliff appears well below the asymptotic regime where softplus(x) β‰ˆ x would justify any approximation.

Discovered while debugging fp16 precision in a KataGo-style network's nn.Mish activations; isolated to softplus via op-attribution probing (every elementary op of Mish was forced to NE; only softplus exhibits the discontinuity, while tanh and mul propagate the broken upstream value correctly).

Stack Trace

  • N/A

To Reproduce

Save the following script as repro_softplus_ne_cliff.py and run with python repro_softplus_ne_cliff.py on an Apple-silicon Mac.

"""Minimal repro: softplus has a hard fp16 cliff on Apple Neural Engine.

Reproduces a hard, single-step output collapse at x β‰ˆ 10.408.
Requires macOS with Apple Neural Engine (M1/M2/M3/M4).

Run: python repro_softplus_ne_cliff.py
"""
import os
import tempfile

import coremltools as ct
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from coremltools.models.compute_device import MLNeuralEngineComputeDevice
from coremltools.models.compute_plan import MLComputePlan


SPATIAL = 8
CHANNELS = 32
FLAT_DIM = SPATIAL * SPATIAL * CHANNELS  # 8 * 8 * 32 = 2048


class M(nn.Module):
    """conv -> softplus -> flatten -> linear (pick-element head).

    Topology chosen to attract NE routing for softplus on macOS14+ targets:
    Conv2d(1->C, k=3, padding=same) followed by a Linear head with NE-friendly
    shapes. Smaller topologies (e.g. 1x1 conv with no Linear) compile but
    softplus stays on CPU.
    """

    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, CHANNELS, kernel_size=3, padding="same")
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(FLAT_DIM, 16)
        with torch.no_grad():
            # Delta conv: conv output[k,i,j] = input[0,i,j] (kernel center only)
            self.conv.weight.zero_()
            self.conv.weight[:, 0, 1, 1] = 1.0
            self.conv.bias.zero_()
            # Pick-element head: fc.out[k] = flat[k] for k in 0..15
            self.fc.weight.zero_()
            self.fc.weight.fill_diagonal_(1.0)
            self.fc.bias.zero_()

    def forward(self, x):
        return self.fc(self.flatten(F.softplus(self.conv(x))))


def _op_kind(operator_name):
    """Strip 'iosXX.' / 'macOSXX.' namespace prefix from a MIL op_type."""
    return operator_name.split(".", 1)[1] if "." in operator_name else operator_name


def main():
    model = M().eval()

    # CPU sanity: forward computes softplus(x_test) at every output element.
    x_test = 2.5
    test_input = torch.full((1, 1, SPATIAL, SPATIAL), x_test, dtype=torch.float32)
    with torch.no_grad():
        cpu_out = model(test_input).numpy().flatten()
    sp_expected = float(F.softplus(torch.tensor(x_test)).item())
    assert np.allclose(cpu_out, sp_expected, atol=1e-5), (
        f"CPU sanity failed: cpu_out[:4]={cpu_out[:4]} != softplus({x_test})={sp_expected}"
    )

    with tempfile.TemporaryDirectory() as d:
        traced = torch.jit.trace(model, test_input)
        mlm = ct.convert(
            traced,
            convert_to="mlprogram",
            inputs=[ct.TensorType(name="x", shape=test_input.shape)],
            minimum_deployment_target=ct.target.macOS14,
            compute_precision=ct.precision.FLOAT16,
        )
        pkg = os.path.join(d, "m.mlpackage")
        mlm.save(pkg)

        loaded = ct.models.MLModel(pkg, compute_units=ct.ComputeUnit.CPU_AND_NE)

        # Routing check β€” assert softplus dispatched to NE.
        plan = MLComputePlan.load_from_path(
            loaded.get_compiled_model_path(),
            compute_units=ct.ComputeUnit.CPU_AND_NE,
        )
        (fn,) = plan.model_structure.program.functions.values()
        ops = list(fn.block.operations)
        sp_ops = [op for op in ops if _op_kind(op.operator_name) == "softplus"]
        assert len(sp_ops) == 1, f"expected exactly 1 softplus op, got {len(sp_ops)}"
        usage = plan.get_compute_device_usage_for_mlprogram_operation(sp_ops[0])
        device_name = (
            type(usage.preferred_compute_device).__name__ if usage else "unknown"
        )
        assert usage is not None and isinstance(
            usage.preferred_compute_device, MLNeuralEngineComputeDevice
        ), f"softplus routed to {device_name}; this repro requires NE"

        # Sweep x in [-15, 15] @ 2000 points, capture softplus output at each x.
        out_name = loaded.get_spec().description.output[0].name
        N = 2000
        xs = np.linspace(-15.0, 15.0, N, dtype=np.float32)
        ys = np.empty(N, dtype=np.float32)
        for i, xi in enumerate(xs):
            inp = np.full((1, 1, SPATIAL, SPATIAL), float(xi), dtype=np.float32)
            ys[i] = loaded.predict({"x": inp})[out_name].flat[0]

        sp_ref = np.maximum(xs, 0) + np.log1p(np.exp(-np.abs(xs)))
        # Cliff: NE output collapses to ~0 while fp32 ref is large.
        cliff_idx = np.where((sp_ref > 5.0) & (ys < 1.0))[0]
        if cliff_idx.size:
            i = cliff_idx[0]
            print(
                f"CLIFF: x={xs[i]:.4f}  ne_out={ys[i]:.4f}  fp32_ref={sp_ref[i]:.4f}"
            )
        else:
            print("No cliff observed β€” please check NE actually engaged.")


if __name__ == "__main__":
    main()

Expected output on an Apple-silicon Mac:

CLIFF: x=10.4077  ne_out=0.0000  fp32_ref=10.4077

Expected behavior

softplus(x) = log(1 + exp(x)) is mathematically continuous and monotonically increasing. For x in [10, 11], the fp32 reference values are β‰ˆ 10.408–11.000 (already in the asymptotic regime where softplus(x) β‰ˆ x to machine precision). Output should remain a smooth monotonic function across the full input range.

Actual behavior

Output drops from β‰ˆ10.408 to 0.0 in a single grid step at x β‰ˆ 10.4077 (cliff observed at 2000-point linear sweep over [-15, 15], so step size β‰ˆ 0.015). The transition is hard β€” adjacent grid points show β‰ˆ10.39 β†’ 0.00. Increasing sweep resolution does not soften the transition; it locates it more precisely.

The cliff is specific to softplus, not the surrounding ops. Figure: each panel plots an op's NE-fp16 output against the fp32 reference computed on the actual NE upstream input that op received (not on a clean fp32 chain). This isolates "introduces error" from "propagates upstream error":

  • Panel 1 β€” softplus. Reference: fp32 softplus(x). Visible gap at x β‰ˆ 10.4 β€” the NE line collapses to 0 while the fp32 line continues to β‰ˆ 10.4. Inset zooms x ∈ [10, 11] showing the discontinuity sharply.
  • Panel 2 β€” tanh. Reference: np.tanh(NE_softplus_output). Byte-for-byte overlap across the full range β€” including past x = 10.4 where both lines go to 0 together.
  • Panel 3 β€” mul (Mish output). Reference: x Β· NE_tanh_output. Byte-for-byte overlap across the full range.

Tanh and mul therefore compute correctly given the broken inputs they receive from softplus. They propagate the cliff; they do not introduce it.

Image

System environment (please complete the following information):

  • macOS: 26.3.1
  • Hardware: Apple M3 Max
  • coremltools: 9.0
  • PyTorch: 2.11.0 (note: coremltools 9.0 documents tested-up-to torch 2.7.0; cliff appears unrelated to this β€” same behavior was reproduced under older toolchain versions during research)
  • NumPy: 2.1.3
  • Python: 3.11.15

Additional context

Workaround

A drop-in replacement using a numerically-stable softplus identity eliminates the cliff while keeping every elementary op on NE:

def softplus_safe(x):
    # softplus(x) = relu(x) + log1p(exp(-|x|))
    return torch.relu(x) + torch.log1p(torch.exp(-torch.abs(x)))

All sub-ops (relu, abs, mul, exp, log lowered with epsilon=1.0 for log1p, add) route to NE and produce continuous output across [-15, 15]. The cliff at x β‰ˆ 10.4 disappears:

Image

Accuracy caveat. Errors are highly asymmetric in x (measured on a 4001-point sweep):

  • Positive tail (x ∈ [3, 7]): max relative error β‰ˆ 9.5 Γ— 10⁻⁴ β€” effectively exact.
  • Moderate negative (x ∈ [βˆ’7, βˆ’3]): max relative error β‰ˆ 0.40 at x β‰ˆ βˆ’6.5, with NE over-estimating softplus (e.g. x = βˆ’6: ne = 3.05 Γ— 10⁻³ vs ref = 2.48 Γ— 10⁻³). Mechanism is fp16 ULP rounding of 1 + small to the nearest representable value above 1 (i.e. 1 + kΒ·2⁻¹⁰ for some k β‰₯ 1) inside log1p, not subnormal flushing.
  • Far negative (x ≲ βˆ’8): NE returns hard 0 because exp(-|x|) underflows fp16 normals; relative error is 1 by definition, but absolute error remains < 3.4 Γ— 10⁻⁴.
  • Peak relative error overall: 1.12 at x β‰ˆ βˆ’7.63 (transition between the two negative regimes).

Net: absolute error stays bounded (≀ 4.2 Γ— 10⁻³) across the full range. The positive tail is essentially exact; the negative tail loses precision but the absolute miss is small enough for activation use. If you need precision at moderately negative x, prefer a different identity.

Why this works. The cliff is a dynamic-range failure inside the NE lowering of softplus. The exact mechanism we cannot confirm without internals, but the cliff position is highly suggestive: a fine 251-point sweep on [10.30, 10.55] places the transition between x = 10.394 (ne β‰ˆ 10.39) and x = 10.395 (ne = 0.0). The fp16-rounded value of 10.395 is 10.3984, which sits just above log(2^15) = 10.39721 β€” pointing to a 2^15-bounded internal representation rather than fp16's full 2^16 range (naΓ―ve log(1 + exp(x)) would not overflow fp16 until x ≳ log(65504) β‰ˆ 11.09). The repro's 2000-point sweep reports x β‰ˆ 10.4077, but that's a grid artifact; the underlying transition is at 10.395. Either way, the safe identity uses exp(-|x|) which is bounded in (0, 1] and cannot overflow regardless of the internal precision used β€” so its correctness does not depend on the exact mechanism.

A converter-side fix could lower nn.Softplus / the MIL softplus op to this identity when targeting NE. We have not validated that across all softplus call sites; a maintainer should confirm scope.

Reproducibility on other Apple Silicon

The repro above was developed on M3 Max. NE op-acceptance rules differ across silicon generations, so the routing precondition does not hold uniformly.

M5 (community report, 2026-05-07): This repro does not trigger the NE precondition on M5 β€” softplus is routed to MLCPUComputeDevice, so the assertion fails before the cliff scan runs:

AssertionError: softplus routed to MLCPUComputeDevice; this repro requires NE

However, an M5 user confirmed mish (which internally uses softplus) does route to NE at specific shapes. Sweeping a (spatial, channel) matrix found ALL_NE routing at:

  • s=16, c=64
  • s=32, c ∈ {16, 32, 64}
  • s=64, c ∈ {4, 8, 16}

At s=32, c=32, the M5 NE produces an analogous fp16 cliff:

CLIFF: x=10.4077  ne_out=0.0000  fp32_ref=10.4077

This suggests the underlying fp16 limit in the NE softplus kernel at x β‰ˆ 10.4 is consistent across silicon generations, but whether the Core ML compiler routes a standalone softplus to NE differs between M3 Max and M5.

Note on the workaround: The relu(x) + log1p(exp(-|x|)) identity above was validated on M3 Max. Because NE routing rules differ on M5 (and likely on future silicon), the workaround's NE coverage and continuity should be re-tested per chip generation β€” confirming both that all sub-ops still route to NE on the target chip, and that no analogous cliff appears in the rewritten graph.

Related


Drafted by Claude Opus 4.7 and reviewed, confirmed, rephrased, and edited by Me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugUnexpected behaviour that should be corrected (type)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions