You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
Closes most of the LFM2.5-8B-A1B quantized single-stream decode gap to
oMLX by fixing the default decode path and extending the compiled C++
path to quantized weights. Two commits:
- **`fb234575` — quantized → flat default (~1.84×) + paged prefill-tps
telemetry fix**
- **`46760077` — quantized compiled flat+paged decode (~2.34× paged over
eager-paged)**
### #1 — quantized single-stream defaults to FLAT decode (~1.84×)
Quantized `lfm2`/`lfm2_moe` was silently defaulting to the
**eager-PAGED** loop (~12 `synchronize_mlx()`/token + blocking
`y.eval()` + no async double-buffering), ~1.84× slower than FLAT on the
measured mxfp8 8B-A1B (74 → 131 tok/s, M5 Max). The default is now keyed
on the authoritative `.scales` tensor signal: quantized → FLAT, bf16 →
PAGED (unchanged). Explicit `use_block_paged_cache` in `config.json`
always wins.
### #4 — paged prefill-tps telemetry fix
The paged path reported a bogus ~37 `prefillTokensPerSecond` (it divided
full-prompt ttft by the attention *suffix* count on warm prefix-cache
hits). Now uses the full-prompt count as the numerator; guarded by
`lfm2_paged_prefill_tps_is_full_prompt_scale_on_warm_reuse`.
### #2 — quantized compiled flat+paged decode (~2.34× over eager-paged)
Extends the compiled C++ decode path (previously bf16-only) to quantized
`lfm2`/`lfm2_moe`. A per-projection quant-info registry
(`mlx_store_quant_info`, keyed on each `.scales` prefix) makes the C++
`(mode, bits, group_size)` dispatch **authoritative** instead of the
companion-tensor heuristic (which mislabels mxfp4/nvfp4 as mxfp8); the
heuristic is retained only as a fallback. Compiled-PAGED is ~2.34× over
eager-PAGED, rescuing the pinned-paged quant path (e.g. server/batched).
A packed embedding (`embed_tokens.scales`) bars the compiled path (C++
does a dense `take`). Env escape hatch:
`MLX_LFM2_DISABLE_QUANT_COMPILED`.
## Correctness
Byte-identical to the pure-Rust eager path across **{mxfp8, 4-bit
affine} × {flat, paged}**, proven via the model-id **eviction oracle**
in `lfm2_compiled_e2e.rs` (`quant_compiled_vs_eager_parity`): loading
the compiled model evicts the eager-ref's process-global weights, so the
eager-ref runs the *independent*
`QuantizedLinear`/`QuantizedSwitchLinear` modules — a C++ dispatch
mislabel would diverge early. This is stronger than a same-graph
`MLX_NO_COMPILE` reference.
## Perf context
This is the **quantized** path — the relevant one for oMLX's 8-bit
headline. Separately verified this session: for **bf16**, our decode
(~110 tok/s) is at **exact op-for-op parity with mlx-lm** and is
**memory-bandwidth-bound** (MoE gather already saturates ~404 GB/s at
the k=4 decode shape, ~80% of the M5 Max ceiling); the residual bf16 gap
to oMLX is host/measurement, not software. The real lever for absolute
decode speed is reducing bytes-per-token (quantization) — which is
exactly what these changes make fast.
## Test plan
- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] `cargo fmt --check` — clean
- [x] 30 unit tests pass (`cargo test -p mlx-core`, incl. the
compiled-registration gate tests)
- [x] Byte-identical parity matrix (mxfp8/4-bit × flat/paged) via the
eviction oracle (opt-in: `LFM2_COMPILED_E2E=1` +
`LFM2_QUANT_MODEL_PATH`)
- [x] `yarn build:native` clean; no `index.d.cts` drift
## Review status
The mandated `codex:adversarial-review` runtime **hung twice** mid
quant-dispatch cross-reference (a codex-runtime issue, not a code
signal). A thorough Claude-subagent adversarial review cleared it **SHIP
/ no blocking bug** — verifying dispatch parity for every projection
class (MoE experts, router gate, dense-MLP, attention q/k/v/out, conv,
untied lm_head) and ruling out the truncated codex concern on all three
plausible completions (packed-embedding guard, registry-authoritative
quant modes, pre-existing flat bf16 invariant).
**Deferred follow-ups (non-blocking):**
- [Medium] Synthetic non-gated quantized parity test (parity is
currently operator-verified via `LFM2_COMPILED_E2E=1`; the synthetic
harness only generates bf16 weights, and the completeness
`debug_assert_eq!` is compiled out in release).
- [Low] `mlx_store_weight` transposes packed 2D quant `.weight` into
`g_weight_transposes` that's never read (pre-existing waste, surfaced
not introduced).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes default decode routing and global compiled weight registration
for quantized LFM2, where incorrect quant dispatch or gating would
affect correctness and performance; mitigated by expanded unit tests and
documented escape hatches.
>
> **Overview**
> **LFM2 load and decode routing** now treat quantized checkpoints
differently: when `use_block_paged_cache` is unset, presence of
`.scales` tensors defaults to **flat** decode (instead of paged), with
resolution moved from `parse_config` to `load_from_dir` so it matches
the registration gate. Explicit `config.json` values still win.
>
> **Quantized models can use the compiled C++ path** (flat and paged):
registration publishes per-projection quant info via
`mlx_store_quant_info`, `should_register_compiled` and
`paged_compiled_decode_setup` use `non_quant_floats_bf16` plus
`MLX_LFM2_DISABLE_QUANT_COMPILED`, and packed `embed_tokens` blocks
compiled registration because the C++ path does a dense embedding
lookup.
>
> **Paged chat performance metrics** use the full prompt token count for
prefill throughput (conv layers re-run the full prompt), fixing inflated
TTFT/prefill-tps on warm prefix-cache hits.
>
> Most other diff hunks are **comment and docstring cleanup**
(phase/W6/PR ticket references removed); behavior in convert, MTP,
Qwen3, and banded-attention modules is unchanged aside from wording.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a4a760d. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0 commit comments