Skip to content

chore: upgrade vite-plus to 0.1.19-alpha.3#1

Draft
fengmk2 wants to merge 1 commit into
mainfrom
update-vite-plus-alpha-0.1.19-alpha.3
Draft

chore: upgrade vite-plus to 0.1.19-alpha.3#1
fengmk2 wants to merge 1 commit into
mainfrom
update-vite-plus-alpha-0.1.19-alpha.3

Conversation

@fengmk2

@fengmk2 fengmk2 commented Apr 21, 2026

Copy link
Copy Markdown
Owner

Upgrade vite-plus and related packages to 0.1.19-alpha.3 alpha version.

@fengmk2 fengmk2 self-assigned this Apr 21, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the versions of vite, vite-plus, and vitest to 0.1.19-alpha.3 across the project's configuration files. The review feedback recommends adopting the Yarn 4 catalog: protocol in package.json for these dependencies and their resolutions to ensure consistency with the definitions in .yarnrc.yml and simplify future updates.

Comment thread package.json
"prettier": "^3.8.3",
"typescript": "^6.0.3",
"vite-plus": "^0.1.18",
"vite-plus": "0.1.19-alpha.3",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the project uses Yarn 4 Catalogs (as defined in .yarnrc.yml), it is recommended to use the catalog: protocol here. This ensures that the version is managed in a single place and remains consistent across all workspaces that depend on vite-plus.

Suggested change
"vite-plus": "0.1.19-alpha.3",
"vite-plus": "catalog:",

Comment thread package.json
Comment on lines +44 to +45
"vite": "npm:@voidzero-dev/vite-plus-core@0.1.19-alpha.3",
"vitest": "npm:@voidzero-dev/vite-plus-test@0.1.19-alpha.3"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These resolutions can also leverage the Yarn 4 Catalog. Using catalog: here keeps the resolutions in sync with the versions defined in .yarnrc.yml, improving maintainability and reducing duplication.

Suggested change
"vite": "npm:@voidzero-dev/vite-plus-core@0.1.19-alpha.3",
"vitest": "npm:@voidzero-dev/vite-plus-test@0.1.19-alpha.3"
"vite": "catalog:",
"vitest": "catalog:"

fengmk2 pushed a commit that referenced this pull request Jun 17, 2026
…ode (~2.34×) + prefill-tps telemetry fix (mlx-node#67)

## Summary

Closes most of the LFM2.5-8B-A1B quantized single-stream decode gap to
oMLX by fixing the default decode path and extending the compiled C++
path to quantized weights. Two commits:

- **`fb234575` — quantized → flat default (~1.84×) + paged prefill-tps
telemetry fix**
- **`46760077` — quantized compiled flat+paged decode (~2.34× paged over
eager-paged)**

### #1 — quantized single-stream defaults to FLAT decode (~1.84×)
Quantized `lfm2`/`lfm2_moe` was silently defaulting to the
**eager-PAGED** loop (~12 `synchronize_mlx()`/token + blocking
`y.eval()` + no async double-buffering), ~1.84× slower than FLAT on the
measured mxfp8 8B-A1B (74 → 131 tok/s, M5 Max). The default is now keyed
on the authoritative `.scales` tensor signal: quantized → FLAT, bf16 →
PAGED (unchanged). Explicit `use_block_paged_cache` in `config.json`
always wins.

### #4 — paged prefill-tps telemetry fix
The paged path reported a bogus ~37 `prefillTokensPerSecond` (it divided
full-prompt ttft by the attention *suffix* count on warm prefix-cache
hits). Now uses the full-prompt count as the numerator; guarded by
`lfm2_paged_prefill_tps_is_full_prompt_scale_on_warm_reuse`.

### #2 — quantized compiled flat+paged decode (~2.34× over eager-paged)
Extends the compiled C++ decode path (previously bf16-only) to quantized
`lfm2`/`lfm2_moe`. A per-projection quant-info registry
(`mlx_store_quant_info`, keyed on each `.scales` prefix) makes the C++
`(mode, bits, group_size)` dispatch **authoritative** instead of the
companion-tensor heuristic (which mislabels mxfp4/nvfp4 as mxfp8); the
heuristic is retained only as a fallback. Compiled-PAGED is ~2.34× over
eager-PAGED, rescuing the pinned-paged quant path (e.g. server/batched).
A packed embedding (`embed_tokens.scales`) bars the compiled path (C++
does a dense `take`). Env escape hatch:
`MLX_LFM2_DISABLE_QUANT_COMPILED`.

## Correctness

Byte-identical to the pure-Rust eager path across **{mxfp8, 4-bit
affine} × {flat, paged}**, proven via the model-id **eviction oracle**
in `lfm2_compiled_e2e.rs` (`quant_compiled_vs_eager_parity`): loading
the compiled model evicts the eager-ref's process-global weights, so the
eager-ref runs the *independent*
`QuantizedLinear`/`QuantizedSwitchLinear` modules — a C++ dispatch
mislabel would diverge early. This is stronger than a same-graph
`MLX_NO_COMPILE` reference.

## Perf context

This is the **quantized** path — the relevant one for oMLX's 8-bit
headline. Separately verified this session: for **bf16**, our decode
(~110 tok/s) is at **exact op-for-op parity with mlx-lm** and is
**memory-bandwidth-bound** (MoE gather already saturates ~404 GB/s at
the k=4 decode shape, ~80% of the M5 Max ceiling); the residual bf16 gap
to oMLX is host/measurement, not software. The real lever for absolute
decode speed is reducing bytes-per-token (quantization) — which is
exactly what these changes make fast.

## Test plan

- [x] `cargo clippy --all-targets -- -D warnings` — clean
- [x] `cargo fmt --check` — clean
- [x] 30 unit tests pass (`cargo test -p mlx-core`, incl. the
compiled-registration gate tests)
- [x] Byte-identical parity matrix (mxfp8/4-bit × flat/paged) via the
eviction oracle (opt-in: `LFM2_COMPILED_E2E=1` +
`LFM2_QUANT_MODEL_PATH`)
- [x] `yarn build:native` clean; no `index.d.cts` drift

## Review status

The mandated `codex:adversarial-review` runtime **hung twice** mid
quant-dispatch cross-reference (a codex-runtime issue, not a code
signal). A thorough Claude-subagent adversarial review cleared it **SHIP
/ no blocking bug** — verifying dispatch parity for every projection
class (MoE experts, router gate, dense-MLP, attention q/k/v/out, conv,
untied lm_head) and ruling out the truncated codex concern on all three
plausible completions (packed-embedding guard, registry-authoritative
quant modes, pre-existing flat bf16 invariant).

**Deferred follow-ups (non-blocking):**
- [Medium] Synthetic non-gated quantized parity test (parity is
currently operator-verified via `LFM2_COMPILED_E2E=1`; the synthetic
harness only generates bf16 weights, and the completeness
`debug_assert_eq!` is compiled out in release).
- [Low] `mlx_store_weight` transposes packed 2D quant `.weight` into
`g_weight_transposes` that's never read (pre-existing waste, surfaced
not introduced).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Changes default decode routing and global compiled weight registration
for quantized LFM2, where incorrect quant dispatch or gating would
affect correctness and performance; mitigated by expanded unit tests and
documented escape hatches.
> 
> **Overview**
> **LFM2 load and decode routing** now treat quantized checkpoints
differently: when `use_block_paged_cache` is unset, presence of
`.scales` tensors defaults to **flat** decode (instead of paged), with
resolution moved from `parse_config` to `load_from_dir` so it matches
the registration gate. Explicit `config.json` values still win.
> 
> **Quantized models can use the compiled C++ path** (flat and paged):
registration publishes per-projection quant info via
`mlx_store_quant_info`, `should_register_compiled` and
`paged_compiled_decode_setup` use `non_quant_floats_bf16` plus
`MLX_LFM2_DISABLE_QUANT_COMPILED`, and packed `embed_tokens` blocks
compiled registration because the C++ path does a dense embedding
lookup.
> 
> **Paged chat performance metrics** use the full prompt token count for
prefill throughput (conv layers re-run the full prompt), fixing inflated
TTFT/prefill-tps on warm prefix-cache hits.
> 
> Most other diff hunks are **comment and docstring cleanup**
(phase/W6/PR ticket references removed); behavior in convert, MTP,
Qwen3, and banded-attention modules is unchanged aside from wording.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
a4a760d. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fengmk2 pushed a commit that referenced this pull request Jun 17, 2026
…n3.6 classes) (mlx-node#72)

## Summary

Collapses the **O(T) per-step gated-delta (GDN) recurrence** on the CUDA
(non-Metal) prefill path into **O(T/BT) chunk-serial batched matmuls**
(cuBLAS / tensor cores) — a device-agnostic, pure-`MxArray` port of the
in-tree Metal chunked kernel
(`crates/mlx-sys/src/metal/gated_delta_chunked.metal.inc`).

- **Zero build changes** (no nvcc/NVRTC) — matmuls route through cuBLAS.
- **Default-on** for the CUDA ops path; `MLX_GDN_KERNEL=perstep` reverts
for same-binary A/B.
- **No Metal impact** — `use_kernel=true` never reaches this path; the
Mac/Metal production path is byte-identical.

This attacks the *"GDN per-step recurrence is the prefill floor"*
bottleneck that [PR mlx-node#71](mlx-node#71
benchmark flagged as the **#1 dense lever**.

## Measured (GB10 / DGX Spark, Qwen3.6, warm, prefill TTFT vs per-step)

| model        | 1577-tok speedup | parity |
|--------------|------------------|--------|
| dense-Q4     | **1.62×**        | byte-identical / late-drift |
| dense-NVFP4  | 1.33×            | identical |
| MoE-Q4       | **1.75×**        | coherent |
| MoE-NVFP4    | 1.40×            | coherent |

Win **grows with prompt length** (dense-Q4: 1.06×@200 → ~1.58×@1577+;
the chunked inverse is fixed-cost while per-step is O(T)). chunked
prefill tok/s climbs to ~242 vs per-step's flat ~150. MoE wins more (GDN
is a larger prefill fraction); NVFP4 less (dequant dominates its
prefill).

## Numerical-stability fixes (each with a Mac regression test)

1. **Triangular inverse overflow.** `M = (I+A)⁻¹` by repeated squaring
overflows f32 at `BT=64` (`N³² ≈ 4e57` before nilpotency zeroes it at
`N⁶⁴`), producing garbage. Replaced with **row-iterative forward
substitution** (FLA / vLLM `solve_tril`): `M[i,:] = eᵢ − A[i,:]·M` — no
powers of A, stable for any `‖A‖`, serial depth independent of T. Test:
`chunked_ops_stable_with_correlated_unit_norm_keys`.
2. **Gate underflow (MoE-only garbage).** `g_log = g.log()` round-trips
through the exp-space gate; strong decay (which MoE has, dense doesn't)
underflows `g` to 0 → `log(0) = -inf` → chunked `gcum_i − gcum_j = inf −
inf = NaN`. Now compute `g_log = -exp(a_log)·softplus(a + dt_bias)`
**directly in log-space** (matches the native `g_log` the fused Metal
gating returns). Test: `compute_g_log_finite_under_strong_decay`.

## Validation

- 6/6 `gated_delta` Rust unit tests, `cargo clippy -p mlx-core
--all-targets`, `cargo fmt` — green.
- Correctness validated on the DGX across **all four Qwen3.6 classes**
(dense/MoE × Q4/NVFP4) — per-step vs chunked greedy A/B, coherent output
everywhere (garbage only before the two fixes above).
- Algorithm derivation in `docs/gdn-chunked-ops-spec.md`.

## Follow-ups (not blocking)

- FLA 16-block row-iterative + block-merge inverse to cut short-prompt
inverse depth 63→~16 (long-prompt asymptote is carry-bound, won't move).
- Runtime non-finite guard → fall back to per-step (overflow is
currently silent; the `Err` fallback doesn't catch `Inf`/`NaN`).
- Possibly raise `CHUNK_THRESHOLD`→256 (the 200-tok win is only 1.06×).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core Qwen3.5 inference recurrence and output numerics on CUDA,
though Metal is gated off, per-step fallback exists, and behavior is
covered by parity/stability tests.
> 
> **Overview**
> Adds a **default-on CUDA prefill fast path** for Qwen gated-delta
(GDN): long, unmasked sequences on the non-Metal ops branch now run
**`gated_delta_chunked_ops`**, a pure `MxArray` chunk-parallel port
(BT=64) that replaces the O(T) per-token loop with O(T/BT) chunk carries
and batched matmuls. Metal/`use_kernel=true` routing is unchanged;
decode and masked calls still use per-step ops.
**`MLX_GDN_KERNEL=perstep`** (and **`ForceChunkedOps`** / `chunked_ops`
aliases) support same-binary A/B.
> 
> Two **numerical fixes** ship with the chunked path:
**`compute_g_log`** computes the decay gate in log-space (avoids
`log(0)` → NaN on strong MoE decay), and
**`invert_i_plus_strict_lower`** builds `(I+A)⁻¹` via forward
substitution instead of f32 power squaring that overflows at BT=64.
Chunked ops errors fall back to per-step with a stderr warning.
> 
> Adds **`docs/gdn-chunked-ops-spec.md`** plus unit tests for env
parsing, chunked vs per-step parity across chunk boundaries,
correlated-key inverse stability, and strong-decay gating.
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
efe9c8f. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant