Skip to content

feat: ANE for Mac/iPhone support#943

Draft
Dorianhgn wants to merge 12 commits into
state-spaces:mainfrom
Dorianhgn:mamba-ane
Draft

feat: ANE for Mac/iPhone support#943
Dorianhgn wants to merge 12 commits into
state-spaces:mainfrom
Dorianhgn:mamba-ane

Conversation

@Dorianhgn
Copy link
Copy Markdown

feat: ANE for Mac/iPhone support

Summary

Introduces ANE (Apple Neural Engine) support for Mamba-3 SISO on Mac and iPhone. This PR includes:

  • mamba_ane package: ANE-native Mamba3 port optimized for Apple Silicon
  • Parity tests: Comprehensive GPU and ANE numerical equivalence validation
  • CoreML integration: Clean PyTorch → CoreML conversion with verified numerical stability
  • Documentation: Interactive visualization and usage guides

Key Metrics

Metric Value Status
ANE Parity Test PASS
max_abs error 2.102e-04 ✅ (< 0.03)
cosine_sim_min 0.999998 ✅ (> 0.999)
Model size (StatefulMambaHybrid1D) 16.99M params
Neural Engine util ~100%

Changes

  • mamba_ane/: ANE-native Mamba3 implementation
  • tests/parity_tests/: GPU and ANE parity validation
  • mamba_ane/README.md: Installation, architecture, and usage guide
  • mamba_ane/docs/mamba3_siso_viz.html: Interactive architecture visualization

Testing

All parity tests pass with excellent numerical margins:

  • GPU ↔ ANE equivalence confirmed (FP32, FP16)
  • PyTorch ↔ CoreML equivalence confirmed (FP16)
  • Numerical differences at float32 rounding noise level

References

See #942 for Neural Engine utilization findings and performance observations.

Dorianhgn and others added 12 commits May 7, 2026 11:08
- mamba_ane/modules/mamba3.py: RMSNormANE + MambaBlock, ANE-compatible
  ops (conv1d, einsum) replacing Triton/CUDA kernels
- mamba_ane/models/hybrid1d.py: Hybrid1DBackbone + StatefulMambaHybrid1D
  with stateful CoreML-friendly inference
- mamba_ane/utils/export.py: CoreML export utility with stateful I/O
- mamba_ane/requirements.txt, README.md: package docs and deps
- tests/parity_tests/og_model.py: OGStatefulMambaHybrid1D golden
  reference model loading weights from mamba-ssm
- tests/parity_tests/parity_lib.py: shared metrics (max_abs, cosine sim)
  and markdown report generation
- tests/parity_tests/test_impl_gpu.py: GPU parity (OG Mamba3 CUDA vs
  MambaBlock FP32/FP16) — PASS, max_abs=6e-06
- tests/parity_tests/test_ane_mac.py: ANE parity (PyTorch vs CoreML
  CPU_AND_NE) — PASS, max_abs=2.1e-04 (threshold 3e-02)
- tests/parity_tests/export_for_parity.py: CoreML export script
- parity_report_gpu.md, parity_report_ane.md: recorded parity results
- README covers env setup, architecture, parity test results
- References mamba3_siso_viz.html for detailed pipeline walkthrough
- Includes GPU→ANE and PyTorch→CoreML numerical equivalence results
- Documents module structure, usage, and testing procedures

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IMPL (OG FP32 → Portable FP32): PASS  max_abs=4.69e-06
PREC (FP32 → FP16):              PASS  max_abs=3.55e-05
MPS FP16 (vs OG FP32 golden):   PASS  max_abs=3.63e-05
CoreML CPU_AND_NE:               PASS  max_abs=5.78e-04
NaN check (64 inputs):           PASS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion traps:

1. The cumsum trap: ANE does not have a native cumulative sum operator. Replacing it with a lower-triangular causal matrix multiplication is the exact mathematical equivalent and uses the ANE's heavily optimized matmul engine.

2. The in-place assignment trap: ANE hates tensor mutation (tensor[:, :-1] = ...). Using F.pad is a perfectly clean, functional alternative.

3. The FP16 overflow trap: -1e9 evaluates to -Inf in FP16, which turns exp(-Inf) into NaN on some Apple Silicon targets due to how zero-multiplication is handled. -1e4 safely underflows to 0.0 without breaking the numerical range.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant