feat: ANE for Mac/iPhone support#943
Draft
Dorianhgn wants to merge 12 commits into
Draft
Conversation
- mamba_ane/modules/mamba3.py: RMSNormANE + MambaBlock, ANE-compatible ops (conv1d, einsum) replacing Triton/CUDA kernels - mamba_ane/models/hybrid1d.py: Hybrid1DBackbone + StatefulMambaHybrid1D with stateful CoreML-friendly inference - mamba_ane/utils/export.py: CoreML export utility with stateful I/O - mamba_ane/requirements.txt, README.md: package docs and deps
- tests/parity_tests/og_model.py: OGStatefulMambaHybrid1D golden reference model loading weights from mamba-ssm - tests/parity_tests/parity_lib.py: shared metrics (max_abs, cosine sim) and markdown report generation - tests/parity_tests/test_impl_gpu.py: GPU parity (OG Mamba3 CUDA vs MambaBlock FP32/FP16) — PASS, max_abs=6e-06 - tests/parity_tests/test_ane_mac.py: ANE parity (PyTorch vs CoreML CPU_AND_NE) — PASS, max_abs=2.1e-04 (threshold 3e-02) - tests/parity_tests/export_for_parity.py: CoreML export script - parity_report_gpu.md, parity_report_ane.md: recorded parity results
- README covers env setup, architecture, parity test results - References mamba3_siso_viz.html for detailed pipeline walkthrough - Includes GPU→ANE and PyTorch→CoreML numerical equivalence results - Documents module structure, usage, and testing procedures Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
IMPL (OG FP32 → Portable FP32): PASS max_abs=4.69e-06 PREC (FP32 → FP16): PASS max_abs=3.55e-05 MPS FP16 (vs OG FP32 golden): PASS max_abs=3.63e-05 CoreML CPU_AND_NE: PASS max_abs=5.78e-04 NaN check (64 inputs): PASS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion traps: 1. The cumsum trap: ANE does not have a native cumulative sum operator. Replacing it with a lower-triangular causal matrix multiplication is the exact mathematical equivalent and uses the ANE's heavily optimized matmul engine. 2. The in-place assignment trap: ANE hates tensor mutation (tensor[:, :-1] = ...). Using F.pad is a perfectly clean, functional alternative. 3. The FP16 overflow trap: -1e9 evaluates to -Inf in FP16, which turns exp(-Inf) into NaN on some Apple Silicon targets due to how zero-multiplication is handled. -1e4 safely underflows to 0.0 without breaking the numerical range.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat: ANE for Mac/iPhone support
Summary
Introduces ANE (Apple Neural Engine) support for Mamba-3 SISO on Mac and iPhone. This PR includes:
Key Metrics
Changes
Testing
All parity tests pass with excellent numerical margins:
References
See #942 for Neural Engine utilization findings and performance observations.