chore: upgrade vite-plus to 0.1.19-alpha.3#1
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the versions of vite, vite-plus, and vitest to 0.1.19-alpha.3 across the project's configuration files. The review feedback recommends adopting the Yarn 4 catalog: protocol in package.json for these dependencies and their resolutions to ensure consistency with the definitions in .yarnrc.yml and simplify future updates.
| "prettier": "^3.8.3", | ||
| "typescript": "^6.0.3", | ||
| "vite-plus": "^0.1.18", | ||
| "vite-plus": "0.1.19-alpha.3", |
There was a problem hiding this comment.
Since the project uses Yarn 4 Catalogs (as defined in .yarnrc.yml), it is recommended to use the catalog: protocol here. This ensures that the version is managed in a single place and remains consistent across all workspaces that depend on vite-plus.
| "vite-plus": "0.1.19-alpha.3", | |
| "vite-plus": "catalog:", |
| "vite": "npm:@voidzero-dev/vite-plus-core@0.1.19-alpha.3", | ||
| "vitest": "npm:@voidzero-dev/vite-plus-test@0.1.19-alpha.3" |
There was a problem hiding this comment.
These resolutions can also leverage the Yarn 4 Catalog. Using catalog: here keeps the resolutions in sync with the versions defined in .yarnrc.yml, improving maintainability and reducing duplication.
| "vite": "npm:@voidzero-dev/vite-plus-core@0.1.19-alpha.3", | |
| "vitest": "npm:@voidzero-dev/vite-plus-test@0.1.19-alpha.3" | |
| "vite": "catalog:", | |
| "vitest": "catalog:" |
…ode (~2.34×) + prefill-tps telemetry fix (mlx-node#67) ## Summary Closes most of the LFM2.5-8B-A1B quantized single-stream decode gap to oMLX by fixing the default decode path and extending the compiled C++ path to quantized weights. Two commits: - **`fb234575` — quantized → flat default (~1.84×) + paged prefill-tps telemetry fix** - **`46760077` — quantized compiled flat+paged decode (~2.34× paged over eager-paged)** ### #1 — quantized single-stream defaults to FLAT decode (~1.84×) Quantized `lfm2`/`lfm2_moe` was silently defaulting to the **eager-PAGED** loop (~12 `synchronize_mlx()`/token + blocking `y.eval()` + no async double-buffering), ~1.84× slower than FLAT on the measured mxfp8 8B-A1B (74 → 131 tok/s, M5 Max). The default is now keyed on the authoritative `.scales` tensor signal: quantized → FLAT, bf16 → PAGED (unchanged). Explicit `use_block_paged_cache` in `config.json` always wins. ### #4 — paged prefill-tps telemetry fix The paged path reported a bogus ~37 `prefillTokensPerSecond` (it divided full-prompt ttft by the attention *suffix* count on warm prefix-cache hits). Now uses the full-prompt count as the numerator; guarded by `lfm2_paged_prefill_tps_is_full_prompt_scale_on_warm_reuse`. ### #2 — quantized compiled flat+paged decode (~2.34× over eager-paged) Extends the compiled C++ decode path (previously bf16-only) to quantized `lfm2`/`lfm2_moe`. A per-projection quant-info registry (`mlx_store_quant_info`, keyed on each `.scales` prefix) makes the C++ `(mode, bits, group_size)` dispatch **authoritative** instead of the companion-tensor heuristic (which mislabels mxfp4/nvfp4 as mxfp8); the heuristic is retained only as a fallback. Compiled-PAGED is ~2.34× over eager-PAGED, rescuing the pinned-paged quant path (e.g. server/batched). A packed embedding (`embed_tokens.scales`) bars the compiled path (C++ does a dense `take`). Env escape hatch: `MLX_LFM2_DISABLE_QUANT_COMPILED`. ## Correctness Byte-identical to the pure-Rust eager path across **{mxfp8, 4-bit affine} × {flat, paged}**, proven via the model-id **eviction oracle** in `lfm2_compiled_e2e.rs` (`quant_compiled_vs_eager_parity`): loading the compiled model evicts the eager-ref's process-global weights, so the eager-ref runs the *independent* `QuantizedLinear`/`QuantizedSwitchLinear` modules — a C++ dispatch mislabel would diverge early. This is stronger than a same-graph `MLX_NO_COMPILE` reference. ## Perf context This is the **quantized** path — the relevant one for oMLX's 8-bit headline. Separately verified this session: for **bf16**, our decode (~110 tok/s) is at **exact op-for-op parity with mlx-lm** and is **memory-bandwidth-bound** (MoE gather already saturates ~404 GB/s at the k=4 decode shape, ~80% of the M5 Max ceiling); the residual bf16 gap to oMLX is host/measurement, not software. The real lever for absolute decode speed is reducing bytes-per-token (quantization) — which is exactly what these changes make fast. ## Test plan - [x] `cargo clippy --all-targets -- -D warnings` — clean - [x] `cargo fmt --check` — clean - [x] 30 unit tests pass (`cargo test -p mlx-core`, incl. the compiled-registration gate tests) - [x] Byte-identical parity matrix (mxfp8/4-bit × flat/paged) via the eviction oracle (opt-in: `LFM2_COMPILED_E2E=1` + `LFM2_QUANT_MODEL_PATH`) - [x] `yarn build:native` clean; no `index.d.cts` drift ## Review status The mandated `codex:adversarial-review` runtime **hung twice** mid quant-dispatch cross-reference (a codex-runtime issue, not a code signal). A thorough Claude-subagent adversarial review cleared it **SHIP / no blocking bug** — verifying dispatch parity for every projection class (MoE experts, router gate, dense-MLP, attention q/k/v/out, conv, untied lm_head) and ruling out the truncated codex concern on all three plausible completions (packed-embedding guard, registry-authoritative quant modes, pre-existing flat bf16 invariant). **Deferred follow-ups (non-blocking):** - [Medium] Synthetic non-gated quantized parity test (parity is currently operator-verified via `LFM2_COMPILED_E2E=1`; the synthetic harness only generates bf16 weights, and the completeness `debug_assert_eq!` is compiled out in release). - [Low] `mlx_store_weight` transposes packed 2D quant `.weight` into `g_weight_transposes` that's never read (pre-existing waste, surfaced not introduced). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes default decode routing and global compiled weight registration for quantized LFM2, where incorrect quant dispatch or gating would affect correctness and performance; mitigated by expanded unit tests and documented escape hatches. > > **Overview** > **LFM2 load and decode routing** now treat quantized checkpoints differently: when `use_block_paged_cache` is unset, presence of `.scales` tensors defaults to **flat** decode (instead of paged), with resolution moved from `parse_config` to `load_from_dir` so it matches the registration gate. Explicit `config.json` values still win. > > **Quantized models can use the compiled C++ path** (flat and paged): registration publishes per-projection quant info via `mlx_store_quant_info`, `should_register_compiled` and `paged_compiled_decode_setup` use `non_quant_floats_bf16` plus `MLX_LFM2_DISABLE_QUANT_COMPILED`, and packed `embed_tokens` blocks compiled registration because the C++ path does a dense embedding lookup. > > **Paged chat performance metrics** use the full prompt token count for prefill throughput (conv layers re-run the full prompt), fixing inflated TTFT/prefill-tps on warm prefix-cache hits. > > Most other diff hunks are **comment and docstring cleanup** (phase/W6/PR ticket references removed); behavior in convert, MTP, Qwen3, and banded-attention modules is unchanged aside from wording. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit a4a760d. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n3.6 classes) (mlx-node#72) ## Summary Collapses the **O(T) per-step gated-delta (GDN) recurrence** on the CUDA (non-Metal) prefill path into **O(T/BT) chunk-serial batched matmuls** (cuBLAS / tensor cores) — a device-agnostic, pure-`MxArray` port of the in-tree Metal chunked kernel (`crates/mlx-sys/src/metal/gated_delta_chunked.metal.inc`). - **Zero build changes** (no nvcc/NVRTC) — matmuls route through cuBLAS. - **Default-on** for the CUDA ops path; `MLX_GDN_KERNEL=perstep` reverts for same-binary A/B. - **No Metal impact** — `use_kernel=true` never reaches this path; the Mac/Metal production path is byte-identical. This attacks the *"GDN per-step recurrence is the prefill floor"* bottleneck that [PR mlx-node#71](mlx-node#71 benchmark flagged as the **#1 dense lever**. ## Measured (GB10 / DGX Spark, Qwen3.6, warm, prefill TTFT vs per-step) | model | 1577-tok speedup | parity | |--------------|------------------|--------| | dense-Q4 | **1.62×** | byte-identical / late-drift | | dense-NVFP4 | 1.33× | identical | | MoE-Q4 | **1.75×** | coherent | | MoE-NVFP4 | 1.40× | coherent | Win **grows with prompt length** (dense-Q4: 1.06×@200 → ~1.58×@1577+; the chunked inverse is fixed-cost while per-step is O(T)). chunked prefill tok/s climbs to ~242 vs per-step's flat ~150. MoE wins more (GDN is a larger prefill fraction); NVFP4 less (dequant dominates its prefill). ## Numerical-stability fixes (each with a Mac regression test) 1. **Triangular inverse overflow.** `M = (I+A)⁻¹` by repeated squaring overflows f32 at `BT=64` (`N³² ≈ 4e57` before nilpotency zeroes it at `N⁶⁴`), producing garbage. Replaced with **row-iterative forward substitution** (FLA / vLLM `solve_tril`): `M[i,:] = eᵢ − A[i,:]·M` — no powers of A, stable for any `‖A‖`, serial depth independent of T. Test: `chunked_ops_stable_with_correlated_unit_norm_keys`. 2. **Gate underflow (MoE-only garbage).** `g_log = g.log()` round-trips through the exp-space gate; strong decay (which MoE has, dense doesn't) underflows `g` to 0 → `log(0) = -inf` → chunked `gcum_i − gcum_j = inf − inf = NaN`. Now compute `g_log = -exp(a_log)·softplus(a + dt_bias)` **directly in log-space** (matches the native `g_log` the fused Metal gating returns). Test: `compute_g_log_finite_under_strong_decay`. ## Validation - 6/6 `gated_delta` Rust unit tests, `cargo clippy -p mlx-core --all-targets`, `cargo fmt` — green. - Correctness validated on the DGX across **all four Qwen3.6 classes** (dense/MoE × Q4/NVFP4) — per-step vs chunked greedy A/B, coherent output everywhere (garbage only before the two fixes above). - Algorithm derivation in `docs/gdn-chunked-ops-spec.md`. ## Follow-ups (not blocking) - FLA 16-block row-iterative + block-merge inverse to cut short-prompt inverse depth 63→~16 (long-prompt asymptote is carry-bound, won't move). - Runtime non-finite guard → fall back to per-step (overflow is currently silent; the `Err` fallback doesn't catch `Inf`/`NaN`). - Possibly raise `CHUNK_THRESHOLD`→256 (the 200-tok win is only 1.06×). 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Touches core Qwen3.5 inference recurrence and output numerics on CUDA, though Metal is gated off, per-step fallback exists, and behavior is covered by parity/stability tests. > > **Overview** > Adds a **default-on CUDA prefill fast path** for Qwen gated-delta (GDN): long, unmasked sequences on the non-Metal ops branch now run **`gated_delta_chunked_ops`**, a pure `MxArray` chunk-parallel port (BT=64) that replaces the O(T) per-token loop with O(T/BT) chunk carries and batched matmuls. Metal/`use_kernel=true` routing is unchanged; decode and masked calls still use per-step ops. **`MLX_GDN_KERNEL=perstep`** (and **`ForceChunkedOps`** / `chunked_ops` aliases) support same-binary A/B. > > Two **numerical fixes** ship with the chunked path: **`compute_g_log`** computes the decay gate in log-space (avoids `log(0)` → NaN on strong MoE decay), and **`invert_i_plus_strict_lower`** builds `(I+A)⁻¹` via forward substitution instead of f32 power squaring that overflows at BT=64. Chunked ops errors fall back to per-step with a stderr warning. > > Adds **`docs/gdn-chunked-ops-spec.md`** plus unit tests for env parsing, chunked vs per-step parity across chunk boundaries, correlated-key inverse stability, and strong-decay gating. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit efe9c8f. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY --> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Upgrade vite-plus and related packages to 0.1.19-alpha.3 alpha version.