fengmk2
diff --git a/‎CLAUDE.md‎
Lines changed: 83 additions & 453 deletions b/‎CLAUDE.md‎
Lines changed: 83 additions & 453 deletions
diff --git a/‎docs/architecture.md‎
Lines changed: 103 additions & 0 deletions b/‎docs/architecture.md‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎docs/cli.md‎
Lines changed: 85 additions & 0 deletions b/‎docs/cli.md‎
Lines changed: 85 additions & 0 deletions
diff --git a/‎docs/ffi-cpp.md‎
Lines changed: 81 additions & 0 deletions b/‎docs/ffi-cpp.md‎
Lines changed: 81 additions & 0 deletions
@@ -0,0 +1,103 @@
+# Architecture
+
+```
+┌──────────────────────────────────────────────────────────┐
+│  TypeScript layer — 6 packages                           │
+│  @mlx-node/lm      Inference, ChatSession, streaming      │
+│  @mlx-node/trl     GRPO/SFT training, datasets            │
+│  @mlx-node/vlm     VLM, OCR, document pipelines           │
+│  @mlx-node/server  HTTP server (/v1/responses, /v1/messages)│
+│  @mlx-node/cli     mlx download, mlx convert, mlx launch  │
+│  @mlx-node/core    Native addon (NAPI bindings)           │
+├──────────────────────────────────────────────────────────┤
+│  Rust compute layer — 5 workspace crates                 │
+│  mlx-core        Models, training, ops, vision (all NAPI) │
+│  mlx-paged-attn  PagedAttention + Metal kernels           │
+│  mlx-sys         Low-level MLX FFI bridge (cpp + headers) │
+│  mlx-db          SQLite training persistence              │
+│  mlx-tui         mlx-train Ratatui binary (no library deps)│
+├──────────────────────────────────────────────────────────┤
+│  C++ bridge → Compiled forward paths                      │
+│  ~300 FFI declarations, compiled decode via mlx::compile  │
+├──────────────────────────────────────────────────────────┤
+│  MLX → Metal / Accelerate GPU backend                     │
+└──────────────────────────────────────────────────────────┘
+```
+
+## Package dependency chain
+
+```
+@mlx-node/core (Rust/NAPI native addon)
+    ├── @mlx-node/lm        inference, models, streaming, tools, profiling
+    │     ├── @mlx-node/trl    training (GRPO, SFT, datasets, rewards)
+    │     ├── @mlx-node/vlm    vision (VLM, OCR, document pipeline)
+    │     └── @mlx-node/server HTTP server (SessionRegistry, /v1/* endpoints)
+    └── @mlx-node/cli       depends on core + lm + server
+```
+
+`mlx-tui` is the workspace binary crate (Ratatui-based `mlx-train` TUI) — it's a workspace member but no other crate depends on it, so it's built separately via `cargo build -p mlx-tui`. `@mlx-node/internal-tools` lives in root `devDependencies` and is not part of the runtime chain.
+
+## Repository layout
+
+```
+mlx-node/
+├── Cargo.toml                  workspace manifest (5 crates)
+├── package.json                npm workspaces (6 packages + examples)
+├── vite.config.ts              Vitest + Oxlint + Oxfmt config
+├── tsconfig.json               TypeScript project references
+│
+├── crates/
+│   ├── mlx-sys/                MLX C/C++ FFI bridge — see ffi-cpp.md
+│   ├── mlx-core/               All NAPI exports: models, training, ops, vision
+│   ├── mlx-paged-attn/         PagedAttention + Metal shaders — see paged-cache.md
+│   ├── mlx-db/                 SQLite training persistence
+│   └── mlx-tui/                mlx-train Ratatui binary (standalone)
+│
+├── packages/
+│   ├── core/                   @mlx-node/core (native addon + .d.cts)
+│   ├── lm/                     @mlx-node/lm
+│   │   └── src/
+│   │       ├── chat-session.ts   ChatSession<M> cross-model wrapper
+│   │       ├── stream.ts         Session-aware models + callback→AsyncGenerator bridge
+│   │       ├── profiling.ts      JS profiling API
+│   │       ├── models/           loadModel, loadSession, configs
+│   │       └── tools/            Tool definition types
+│   ├── trl/                    @mlx-node/trl (trainers/, data/, utils/)
+│   ├── vlm/                    @mlx-node/vlm (models/, pipeline/)
+│   ├── server/                 @mlx-node/server
+│   │   └── src/
+│   │       ├── endpoints/        /v1/responses, /v1/messages
+│   │       └── session-registry.ts  SessionRegistry — owns ChatSession lifetimes
+│   └── cli/                    @mlx-node/cli — see cli.md
+│
+├── __test__/                   TypeScript tests
+└── examples/                   lm.ts, vlm-inference.ts, paddle-ocr-pipeline.ts, tool-use-example.ts, grpo/, sft/
+```
+
+## Build flow
+
+| Command                            | Output                                                                                         |
+| ---------------------------------- | ---------------------------------------------------------------------------------------------- |
+| `yarn build`                       | `yarn build:native && yarn build:ts`                                                           |
+| `yarn build:native`                | `packages/core/index.cjs`, `mlx-core.darwin-arm64.node`, `mlx.metallib`, `paged_attn.metallib` |
+| `yarn build:ts`                    | `packages/*/dist/` via `tsc -b` (project references)                                           |
+| `yarn typecheck`                   | TypeScript type-check only                                                                     |
+| `cargo build --release -p mlx-tui` | `mlx-train` TUI binary                                                                         |
+
+`yarn build:native` is the **canonical native build** — runs the napi-rs pipeline through `packages/core/build.ts` (executed via `oxnode`). Running `cargo build` directly does **not** produce the `.node` addon.
+
+## Adding a new native operation
+
+1. Add FFI declaration in `crates/mlx-sys/src/lib.rs`.
+2. Add C++ bridge function in the appropriate `crates/mlx-sys/src/mlx_*.cpp` file (see [ffi-cpp.md](ffi-cpp.md) for which file owns what).
+3. Add a Rust wrapper in `crates/mlx-core/src/` with `#[napi]` exports.
+4. Run `yarn build:native` to regenerate NAPI bindings and `packages/core/index.d.cts`.
+5. Add tests using TypedArray helpers.
+
+If you added a **new** `.cpp` file, run `rm -rf target/release/build/mlx-sys-*` once — the `cc` crate caches the source-file list across builds and won't pick up new files otherwise.
+
+## Adding a TypeScript utility
+
+1. Pick the package by responsibility: `lm` (inference), `trl` (training), `vlm` (vision), `server` (HTTP), `cli` (CLI).
+2. Add to `packages/<pkg>/src/`, export from `packages/<pkg>/src/index.ts`.
+3. Run `yarn build:ts && yarn typecheck`.
@@ -0,0 +1,85 @@
+# CLI (`@mlx-node/cli`)
+
+The `mlx` binary is built from `packages/cli/` and exposes three top-level commands: `download`, `convert`, and `launch`.
+
+## `mlx download`
+
+### Models
+
+```bash
+mlx download model --model Qwen/Qwen3-0.6B
+```
+
+| Flag             | Default           | Purpose                                                |
+| ---------------- | ----------------- | ------------------------------------------------------ |
+| `-m`, `--model`  | `Qwen/Qwen3-0.6B` | HuggingFace model id                                   |
+| `-g`, `--glob`   | —                 | Filename pattern filter (download only matching files) |
+| `--set-token`    | —                 | Store HuggingFace credentials                          |
+| `-o`, `--output` | —                 | Output directory                                       |
+
+### Datasets
+
+```bash
+mlx download dataset
+```
+
+Default dataset: `openai/gsm8k`. Parquet inputs are automatically converted to JSONL via `convertParquetToJsonl()`.
+
+| Flag               | Default        | Purpose                |
+| ------------------ | -------------- | ---------------------- |
+| `-d`, `--dataset`  | `openai/gsm8k` | HuggingFace dataset id |
+| `-r`, `--revision` | —              | Dataset revision       |
+| `-o`, `--output`   | —              | Output directory       |
+
+## `mlx convert`
+
+The convert command uses `--input` / `--output` (not `--model`).
+
+### Dtype conversion
+
+```bash
+mlx convert --input ./model --output ./model-bf16 --dtype bf16
+```
+
+### Quantization (affine, default)
+
+```bash
+mlx convert --input ./model --output ./model-q --quantize --q-recipe mixed_4_6
+```
+
+| Flag               | Purpose                                                                         |
+| ------------------ | ------------------------------------------------------------------------------- |
+| `-i`, `--input`    | Source model directory (required)                                               |
+| `-o`, `--output`   | Output directory (required)                                                     |
+| `-d`, `--dtype`    | Target dtype: `float32` / `float16` / `bfloat16`                                |
+| `-q`, `--quantize` | Enable quantization                                                             |
+| `--q-recipe`       | One of `mixed_2_6`, `mixed_3_4`, `mixed_3_6`, `mixed_4_6`, `qwen3_5`, `unsloth` |
+| `--q-mode`         | `affine` (default) or `mxfp8`                                                   |
+| `--imatrix-path`   | Path to imatrix file for AWQ pre-scaling                                        |
+| `--mmproj`         | Vision-encoder conversion path                                                  |
+| `-v`, `--verbose`  | Verbose logging                                                                 |
+
+### GGUF → SafeTensors
+
+```bash
+mlx convert --input ./model.gguf --output ./model-mlx
+```
+
+Auto-detected by the `.gguf` extension. Supports BF16, F16, F32, Q4_0, Q4_1, Q8_0 source quantization types.
+
+### Model-type auto-detection
+
+The converter auto-detects model families and applies family-specific sanitization passes:
+
+- `qwen3_5`, `qwen3_5_moe`
+- `gemma4`
+- `paddleocr-vl`, `qianfan-ocr`
+- `pp-lcnet-ori`, `uvdoc`
+
+Sharded models are also supported (parses `model.safetensors.index.json`).
+
+Foreign weight formats: Paddle `.pdiparams`, PyTorch `.pkl`.
+
+## `mlx launch claude`
+
+Launches the local `@mlx-node/server` and spawns Claude Code against it — the entry point for using MLX-Node as a Claude Code backend. The "serve" terminology in commit messages refers to internal server components only; there is no `mlx serve` command.
@@ -0,0 +1,81 @@
+# C++ FFI bridge
+
+The bridge between MLX (C++) and the NAPI/Rust layer lives in `crates/mlx-sys/`. The Rust side declares the FFI surface in `lib.rs`; the C++ side implements each declaration across topical `.cpp` files compiled by the `cc` crate.
+
+## File inventory
+
+`crates/mlx-sys/src/`:
+
+| File                     | Purpose                                                                                                                                               |
+| ------------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `mlx_array_ops.cpp`      | Array construction, arithmetic, indexing, dtype-safe scalar ops                                                                                       |
+| `mlx_advanced_ops.cpp`   | quantized_matmul, gather_qmm, conv2d, FP8 dequant, PaddleOCR forward                                                                                  |
+| `mlx_nn_ops.cpp`         | NN ops, data extraction, random, math                                                                                                                 |
+| `mlx_fused_ops.cpp`      | Fused SwiGLU MLP and supporting ops                                                                                                                   |
+| `mlx_misc_ops.cpp`       | Synchronization, compiled sampling helpers                                                                                                            |
+| `mlx_stream.cpp`         | Stream/device management, memory limits                                                                                                               |
+| `mlx_autograd.cpp`       | `value_and_grad` integration                                                                                                                          |
+| `mlx_gated_delta.cpp`    | Metal GDN kernel opaque handles and shader indexing                                                                                                   |
+| `mlx_qwen35.cpp`         | Compiled Qwen3.5 dense forward (uses `mlx::core::compile`)                                                                                            |
+| `mlx_qwen35_moe.cpp`     | Compiled Qwen3.5 MoE forward with expert routing (uses `mlx::core::compile`)                                                                          |
+| `mlx_qwen35_vlm.cpp`     | Qwen3.5 VLM prefill — runs the full LM forward over text+vision embeddings and stores caches; the compiled decode path then resumes from those caches |
+| `mlx_qwen35_common.h`    | Shared compiled-forward helpers — linear_proj, attn, GDN, RoPE                                                                                        |
+| `mlx_common.h`           | FFI macros, error handling, array conversion                                                                                                          |
+| `mlx_common_weights.cpp` | Common weight storage for compiled forward passes                                                                                                     |
+| `mlx_paged_dispatch.cpp` | C++ paged-attention kernel dispatch                                                                                                                   |
+| `mlx_paged_ops.cpp`      | `PagedKVWrite` / `PagedAttention` custom MLX ops (largest file in the bridge)                                                                         |
+| `mlx_paged_profile.cpp`  | Profile-run helpers for auto-sizing the block pool                                                                                                    |
+
+`crates/mlx-sys/src/lib.rs` is the FFI declaration root (~300 `pub fn` wrappers around `unsafe extern "C-unwind"` blocks).
+
+## Compiled forward paths
+
+Qwen3.5 dense + MoE decode use `mlx::core::compile` to cache the forward graph: trace once, reuse via `compile_replace`. Key design points:
+
+- Pre-allocated KV caches passed in as compile inputs
+- `fast::rope` invoked with an array-valued offset
+- `slice_update` invoked with an array start index
+- Path only enabled when `mlx_qwen35_weight_count() > 0`
+
+```
+mlx_qwen35.cpp        dense compiled decode (mlx::core::compile)
+mlx_qwen35_moe.cpp    MoE compiled decode + expert routing (mlx::core::compile)
+mlx_qwen35_vlm.cpp    VLM prefill — stores caches that the compiled decode path resumes from
+mlx_qwen35_common.h   shared helpers (linear_proj, attn, GDN, RoPE)
+```
+
+### Pitfalls
+
+- `mlx::core::array` has **no default constructor** — initialize via `mlx_array_from_scalar(...)` or other helpers.
+- `int32` is not in scope inside inner namespaces — use `mlx::core::int32`.
+- Adding a **new** `.cpp` file requires `rm -rf target/release/build/mlx-sys-*` once; the `cc` crate caches its source-file list across builds and won't pick up new files otherwise.
+
+### Env vars
+
+| Var                     | Effect                                                                  |
+| ----------------------- | ----------------------------------------------------------------------- |
+| `MLX_NO_COMPILE=1`      | Disables the compiled forward path; falls back to per-step Rust forward |
+| `MLX_EVAL_ALL_CACHES=1` | Reverts to eval-all-caches strategy (vs. the default token-only eval)   |
+
+## Process-wide globals
+
+Compiled paths use process-wide globals in `crates/mlx-core/src/models/qwen3_5/model.rs`:
+
+- `DENSE_COMPILED_MUTEX: std::sync::Mutex<()>` — serializes dense compiled-path access
+- `COMPILED_WEIGHTS_RWLOCK: std::sync::RwLock<()>` — read locks during compiled forward, write locks during weight load
+
+The paged-cache code path bypasses both locks entirely (see [paged-cache.md](paged-cache.md) for the compile-lockout contract).
+
+## Metal shaders
+
+`crates/mlx-paged-attn/metal/`:
+
+| File                              | Purpose                               |
+| --------------------------------- | ------------------------------------- |
+| `attention/paged_attention.metal` | Paged-attention attention kernel      |
+| `cache/reshape_and_cache.metal`   | KV cache reshape operations           |
+| `cache/copy_blocks.metal`         | Block copy for paged cache management |
+| `float8.metal`                    | FP8 type conversions and helpers      |
+| `utils.metal`                     | Common Metal utilities                |
+
+`crates/mlx-sys/build.rs` compiles `.metal` sources into `paged_attn.metallib` and copies both `paged_attn.metallib` and `mlx.metallib` into `target/<profile>/` and `target/<profile>/deps/` so integration tests discover them.