feat: ANE for Mac/iPhone support by Dorianhgn · Pull Request #943 · state-spaces/mamba

Dorianhgn · 2026-05-07T10:00:57Z

feat: ANE for Mac/iPhone support

Summary

Introduces ANE (Apple Neural Engine) support for Mamba-3 SISO on Mac and iPhone. This PR includes:

mamba_ane package: ANE-native Mamba3 port optimized for Apple Silicon
Parity tests: Comprehensive GPU and ANE numerical equivalence validation
CoreML integration: Clean PyTorch → CoreML conversion with verified numerical stability
Documentation: Interactive visualization and usage guides

Key Metrics

Metric	Value	Status
ANE Parity Test	PASS	✅
max_abs error	2.102e-04	✅ (< 0.03)
cosine_sim_min	0.999998	✅ (> 0.999)
Model size (StatefulMambaHybrid1D)	16.99M params	✅
Neural Engine util	~100%	✅

Changes

mamba_ane/: ANE-native Mamba3 implementation
tests/parity_tests/: GPU and ANE parity validation
mamba_ane/README.md: Installation, architecture, and usage guide
mamba_ane/docs/mamba3_siso_viz.html: Interactive architecture visualization

Testing

All parity tests pass with excellent numerical margins:

GPU ↔ ANE equivalence confirmed (FP32, FP16)
PyTorch ↔ CoreML equivalence confirmed (FP16)
Numerical differences at float32 rounding noise level

References

See #942 for Neural Engine utilization findings and performance observations.

- mamba_ane/modules/mamba3.py: RMSNormANE + MambaBlock, ANE-compatible ops (conv1d, einsum) replacing Triton/CUDA kernels - mamba_ane/models/hybrid1d.py: Hybrid1DBackbone + StatefulMambaHybrid1D with stateful CoreML-friendly inference - mamba_ane/utils/export.py: CoreML export utility with stateful I/O - mamba_ane/requirements.txt, README.md: package docs and deps

- tests/parity_tests/og_model.py: OGStatefulMambaHybrid1D golden reference model loading weights from mamba-ssm - tests/parity_tests/parity_lib.py: shared metrics (max_abs, cosine sim) and markdown report generation - tests/parity_tests/test_impl_gpu.py: GPU parity (OG Mamba3 CUDA vs MambaBlock FP32/FP16) — PASS, max_abs=6e-06 - tests/parity_tests/test_ane_mac.py: ANE parity (PyTorch vs CoreML CPU_AND_NE) — PASS, max_abs=2.1e-04 (threshold 3e-02) - tests/parity_tests/export_for_parity.py: CoreML export script - parity_report_gpu.md, parity_report_ane.md: recorded parity results

- README covers env setup, architecture, parity test results - References mamba3_siso_viz.html for detailed pipeline walkthrough - Includes GPU→ANE and PyTorch→CoreML numerical equivalence results - Documents module structure, usage, and testing procedures Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…brid

…eML export

…Triton crash)

IMPL (OG FP32 → Portable FP32): PASS max_abs=4.69e-06 PREC (FP32 → FP16): PASS max_abs=3.55e-05 MPS FP16 (vs OG FP32 golden): PASS max_abs=3.63e-05 CoreML CPU_AND_NE: PASS max_abs=5.78e-04 NaN check (64 inputs): PASS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion traps: 1. The cumsum trap: ANE does not have a native cumulative sum operator. Replacing it with a lower-triangular causal matrix multiplication is the exact mathematical equivalent and uses the ANE's heavily optimized matmul engine. 2. The in-place assignment trap: ANE hates tensor mutation (tensor[:, :-1] = ...). Using F.pad is a perfectly clean, functional alternative. 3. The FP16 overflow trap: -1e9 evaluates to -Inf in FP16, which turns exp(-Inf) into NaN on some Apple Silicon targets due to how zero-multiplication is handled. -1e4 safely underflows to 0.0 without breaking the numerical range.

Dorianhgn and others added 12 commits May 7, 2026 11:08

feat(ane): add Mamba3ParallelPortable — unchunked SSD parallel forward

14db7e4

feat(ane): add ParallelHybridBackbone + StatefulMambaParallelHybrid

40025aa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(ane): export Mamba3ParallelPortable and StatefulMambaParallelHy…

3e7baa2

…brid

feat(ane): add export_parallel.py for StatefulMambaParallelHybrid Cor…

8a6d65f

…eML export

test: GPU parity test for StatefulMambaParallelHybrid

c047b0b

test: Mac/ANE CoreML parity test for StatefulMambaParallelHybrid

611b49d

fix(test): move og_raw to DEVICE in golden saving loop (CPU tensor → …

be43b03

…Triton crash)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ANE for Mac/iPhone support#943

feat: ANE for Mac/iPhone support#943
Dorianhgn wants to merge 12 commits into
state-spaces:mainfrom
Dorianhgn:mamba-ane

Dorianhgn commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dorianhgn commented May 7, 2026

feat: ANE for Mac/iPhone support

Summary

Key Metrics

Changes

Testing

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant