Softplus on Apple Neural Engine has a hard fp16 discontinuity at x ≈ 10.4 (output drops to 0)

## 🐞Describing the bug
**Bug.** `softplus` exhibits a hard, single-step output collapse on Apple Neural Engine in fp16: at x ≈ 10.4 the output drops from ≈10.4 to 0.0 across one grid point — not gradual precision loss. (A fine sweep places the transition between x = 10.394 and x = 10.395; the report-wide "10.4077" is the 2000-point sweep's nearest grid point.)

**Affected.** Models using `softplus` (or `nn.Mish`, which is `x * tanh(softplus(x))`) routed to NE in fp16. CPU and GPU compute units are unaffected. The cliff appears well below the asymptotic regime where `softplus(x) ≈ x` would justify any approximation.

**Discovered while** debugging fp16 precision in a KataGo-style network's `nn.Mish` activations; isolated to softplus via op-attribution probing (every elementary op of Mish was forced to NE; only softplus exhibits the discontinuity, while tanh and mul propagate the broken upstream value correctly).

## Stack Trace
- N/A

## To Reproduce

Save the following script as `repro_softplus_ne_cliff.py` and run with `python repro_softplus_ne_cliff.py` on an Apple-silicon Mac.

```python
"""Minimal repro: softplus has a hard fp16 cliff on Apple Neural Engine.

Reproduces a hard, single-step output collapse at x ≈ 10.408.
Requires macOS with Apple Neural Engine (M1/M2/M3/M4).

Run: python repro_softplus_ne_cliff.py
"""
import os
import tempfile

import coremltools as ct
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from coremltools.models.compute_device import MLNeuralEngineComputeDevice
from coremltools.models.compute_plan import MLComputePlan


SPATIAL = 8
CHANNELS = 32
FLAT_DIM = SPATIAL * SPATIAL * CHANNELS  # 8 * 8 * 32 = 2048


class M(nn.Module):
    """conv -> softplus -> flatten -> linear (pick-element head).

    Topology chosen to attract NE routing for softplus on macOS14+ targets:
    Conv2d(1->C, k=3, padding=same) followed by a Linear head with NE-friendly
    shapes. Smaller topologies (e.g. 1x1 conv with no Linear) compile but
    softplus stays on CPU.
    """

    def __init__(self):
        super().__init__()
        self.conv = nn.Conv2d(1, CHANNELS, kernel_size=3, padding="same")
        self.flatten = nn.Flatten()
        self.fc = nn.Linear(FLAT_DIM, 16)
        with torch.no_grad():
            # Delta conv: conv output[k,i,j] = input[0,i,j] (kernel center only)
            self.conv.weight.zero_()
            self.conv.weight[:, 0, 1, 1] = 1.0
            self.conv.bias.zero_()
            # Pick-element head: fc.out[k] = flat[k] for k in 0..15
            self.fc.weight.zero_()
            self.fc.weight.fill_diagonal_(1.0)
            self.fc.bias.zero_()

    def forward(self, x):
        return self.fc(self.flatten(F.softplus(self.conv(x))))


def _op_kind(operator_name):
    """Strip 'iosXX.' / 'macOSXX.' namespace prefix from a MIL op_type."""
    return operator_name.split(".", 1)[1] if "." in operator_name else operator_name


def main():
    model = M().eval()

    # CPU sanity: forward computes softplus(x_test) at every output element.
    x_test = 2.5
    test_input = torch.full((1, 1, SPATIAL, SPATIAL), x_test, dtype=torch.float32)
    with torch.no_grad():
        cpu_out = model(test_input).numpy().flatten()
    sp_expected = float(F.softplus(torch.tensor(x_test)).item())
    assert np.allclose(cpu_out, sp_expected, atol=1e-5), (
        f"CPU sanity failed: cpu_out[:4]={cpu_out[:4]} != softplus({x_test})={sp_expected}"
    )

    with tempfile.TemporaryDirectory() as d:
        traced = torch.jit.trace(model, test_input)
        mlm = ct.convert(
            traced,
            convert_to="mlprogram",
            inputs=[ct.TensorType(name="x", shape=test_input.shape)],
            minimum_deployment_target=ct.target.macOS14,
            compute_precision=ct.precision.FLOAT16,
        )
        pkg = os.path.join(d, "m.mlpackage")
        mlm.save(pkg)

        loaded = ct.models.MLModel(pkg, compute_units=ct.ComputeUnit.CPU_AND_NE)

        # Routing check — assert softplus dispatched to NE.
        plan = MLComputePlan.load_from_path(
            loaded.get_compiled_model_path(),
            compute_units=ct.ComputeUnit.CPU_AND_NE,
        )
        (fn,) = plan.model_structure.program.functions.values()
        ops = list(fn.block.operations)
        sp_ops = [op for op in ops if _op_kind(op.operator_name) == "softplus"]
        assert len(sp_ops) == 1, f"expected exactly 1 softplus op, got {len(sp_ops)}"
        usage = plan.get_compute_device_usage_for_mlprogram_operation(sp_ops[0])
        device_name = (
            type(usage.preferred_compute_device).__name__ if usage else "unknown"
        )
        assert usage is not None and isinstance(
            usage.preferred_compute_device, MLNeuralEngineComputeDevice
        ), f"softplus routed to {device_name}; this repro requires NE"

        # Sweep x in [-15, 15] @ 2000 points, capture softplus output at each x.
        out_name = loaded.get_spec().description.output[0].name
        N = 2000
        xs = np.linspace(-15.0, 15.0, N, dtype=np.float32)
        ys = np.empty(N, dtype=np.float32)
        for i, xi in enumerate(xs):
            inp = np.full((1, 1, SPATIAL, SPATIAL), float(xi), dtype=np.float32)
            ys[i] = loaded.predict({"x": inp})[out_name].flat[0]

        sp_ref = np.maximum(xs, 0) + np.log1p(np.exp(-np.abs(xs)))
        # Cliff: NE output collapses to ~0 while fp32 ref is large.
        cliff_idx = np.where((sp_ref > 5.0) & (ys < 1.0))[0]
        if cliff_idx.size:
            i = cliff_idx[0]
            print(
                f"CLIFF: x={xs[i]:.4f}  ne_out={ys[i]:.4f}  fp32_ref={sp_ref[i]:.4f}"
            )
        else:
            print("No cliff observed — please check NE actually engaged.")


if __name__ == "__main__":
    main()
```

Expected output on an Apple-silicon Mac:

```
CLIFF: x=10.4077  ne_out=0.0000  fp32_ref=10.4077
```

**Expected behavior**

`softplus(x) = log(1 + exp(x))` is mathematically continuous and monotonically increasing. For x in [10, 11], the fp32 reference values are ≈ 10.408–11.000 (already in the asymptotic regime where `softplus(x) ≈ x` to machine precision). Output should remain a smooth monotonic function across the full input range.

**Actual behavior**

Output drops from ≈10.408 to 0.0 in a single grid step at x ≈ 10.4077 (cliff observed at 2000-point linear sweep over [-15, 15], so step size ≈ 0.015). The transition is hard — adjacent grid points show ≈10.39 → 0.00. Increasing sweep resolution does not soften the transition; it locates it more precisely.

The cliff is specific to softplus, not the surrounding ops. Figure: each panel plots an op's NE-fp16 output against the fp32 reference computed on the *actual NE upstream input that op received* (not on a clean fp32 chain). This isolates "introduces error" from "propagates upstream error":

- **Panel 1 — softplus.** Reference: `fp32 softplus(x)`. Visible gap at x ≈ 10.4 — the NE line collapses to 0 while the fp32 line continues to ≈ 10.4. Inset zooms x ∈ [10, 11] showing the discontinuity sharply.
- **Panel 2 — tanh.** Reference: `np.tanh(NE_softplus_output)`. Byte-for-byte overlap across the full range — including past x = 10.4 where both lines go to 0 together.
- **Panel 3 — mul (Mish output).** Reference: `x · NE_tanh_output`. Byte-for-byte overlap across the full range.

Tanh and mul therefore compute correctly *given the broken inputs they receive* from softplus. They propagate the cliff; they do not introduce it.

<img width="2250" height="750" alt="Image" src="https://github.com/user-attachments/assets/fd24e139-1f69-407e-bdab-a5e55b70b81e" />

## System environment (please complete the following information):
- **macOS:** 26.3.1
- **Hardware:** Apple M3 Max
- **coremltools:** 9.0
- **PyTorch:** 2.11.0 *(note: coremltools 9.0 documents tested-up-to torch 2.7.0; cliff appears unrelated to this — same behavior was reproduced under older toolchain versions during research)*
- **NumPy:** 2.1.3
- **Python:** 3.11.15

## Additional context

**Workaround**

A drop-in replacement using a numerically-stable softplus identity eliminates the cliff while keeping every elementary op on NE:

```python
def softplus_safe(x):
    # softplus(x) = relu(x) + log1p(exp(-|x|))
    return torch.relu(x) + torch.log1p(torch.exp(-torch.abs(x)))
```

All sub-ops (`relu`, `abs`, `mul`, `exp`, `log` lowered with `epsilon=1.0` for `log1p`, `add`) route to NE and produce continuous output across [-15, 15]. The cliff at x ≈ 10.4 disappears:

<img width="1500" height="750" alt="Image" src="https://github.com/user-attachments/assets/882a28c4-85c1-4279-8fdb-8b9f1b226f22" />


**Accuracy caveat.** Errors are highly asymmetric in x (measured on a 4001-point sweep):

- **Positive tail (x ∈ [3, 7]):** max relative error ≈ 9.5 × 10⁻⁴ — effectively exact.
- **Moderate negative (x ∈ [−7, −3]):** max relative error ≈ 0.40 at x ≈ −6.5, with NE *over*-estimating softplus (e.g. x = −6: ne = 3.05 × 10⁻³ vs ref = 2.48 × 10⁻³). Mechanism is fp16 ULP rounding of `1 + small` to the nearest representable value above 1 (i.e. `1 + k·2⁻¹⁰` for some k ≥ 1) inside `log1p`, not subnormal flushing.
- **Far negative (x ≲ −8):** NE returns hard 0 because `exp(-|x|)` underflows fp16 normals; relative error is 1 by definition, but absolute error remains < 3.4 × 10⁻⁴.
- **Peak relative error overall:** 1.12 at x ≈ −7.63 (transition between the two negative regimes).

Net: absolute error stays bounded (≤ 4.2 × 10⁻³) across the full range. The positive tail is essentially exact; the negative tail loses precision but the absolute miss is small enough for activation use. If you need precision at moderately negative x, prefer a different identity.

**Why this works.** The cliff is a dynamic-range failure inside the NE lowering of `softplus`. The exact mechanism we cannot confirm without internals, but the cliff position is highly suggestive: a fine 251-point sweep on [10.30, 10.55] places the transition between x = 10.394 (ne ≈ 10.39) and x = 10.395 (ne = 0.0). The fp16-rounded value of 10.395 is 10.3984, which sits *just above* `log(2^15) = 10.39721` — pointing to a 2^15-bounded internal representation rather than fp16's full 2^16 range (naïve `log(1 + exp(x))` would not overflow fp16 until x ≳ `log(65504) ≈ 11.09`). The repro's 2000-point sweep reports x ≈ 10.4077, but that's a grid artifact; the underlying transition is at 10.395. Either way, the safe identity uses `exp(-|x|)` which is bounded in (0, 1] and cannot overflow regardless of the internal precision used — so its correctness does not depend on the exact mechanism.

A converter-side fix could lower `nn.Softplus` / the MIL `softplus` op to this identity when targeting NE. We have not validated that across all softplus call sites; a maintainer should confirm scope.

**Reproducibility on other Apple Silicon**

The repro above was developed on M3 Max. NE op-acceptance rules differ across silicon generations, so the routing precondition does not hold uniformly.

**M5 (community report, 2026-05-07):** This repro does **not** trigger the NE precondition on M5 — `softplus` is routed to `MLCPUComputeDevice`, so the assertion fails before the cliff scan runs:

```
AssertionError: softplus routed to MLCPUComputeDevice; this repro requires NE
```

However, an M5 user confirmed `mish` (which internally uses `softplus`) does route to NE at specific shapes. Sweeping a `(spatial, channel)` matrix found ALL_NE routing at:

*   `s=16, c=64`
*   `s=32, c ∈ {16, 32, 64}`
*   `s=64, c ∈ {4, 8, 16}`

At `s=32, c=32`, the M5 NE produces an analogous fp16 cliff:

```
CLIFF: x=10.4077  ne_out=0.0000  fp32_ref=10.4077
```

This suggests the underlying fp16 limit in the NE softplus kernel at x ≈ 10.4 is **consistent across silicon generations**, but whether the Core ML compiler routes a standalone `softplus` to NE differs between M3 Max and M5.

> **Note on the workaround:** The `relu(x) + log1p(exp(-|x|))` identity above was validated on M3 Max. Because NE routing rules differ on M5 (and likely on future silicon), the workaround's NE coverage and continuity should be **re-tested per chip generation** — confirming both that all sub-ops still route to NE on the target chip, and that no analogous cliff appears in the rewritten graph.

## Related

- #2618 — a separate PR proposing a fix for softplus precision on NE. An attribution-data comment supporting its premise has also been drafted.

---

Drafted by Claude Opus 4.7 and reviewed, confirmed, rephrased, and edited by Me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Softplus on Apple Neural Engine has a hard fp16 discontinuity at x ≈ 10.4 (output drops to 0) #2687

🐞Describing the bug

Stack Trace

To Reproduce

System environment (please complete the following information):

Additional context

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Softplus on Apple Neural Engine has a hard fp16 discontinuity at x ≈ 10.4 (output drops to 0) #2687

Description

🐞Describing the bug

Stack Trace

To Reproduce

System environment (please complete the following information):

Additional context

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions