fix(convert): unblock 250GB+ MoE conversions (CPU stream + drain on write) (mlx-node#63)

Brooooooklyn · claude · web-flow · commit cf654d432519 · 2026-05-27T16:53:10.000+08:00
## Summary `mlx convert` on a 250 GB Qwen3.5 122B-A10B checkpoint (256 experts × 48 layers) failed two different ways: 1. **macOS Metal watchdog kill** (~5 s) when materializing a 1.6 GB sliced view of a fused \`experts.gate_up_proj\` backed by a cold mmap'd HF shard — surfaces as \`kIOGPUCommandBufferCallbackErrorTimeout\` mid-shard. 2. **Silent OOM-kill at shard 35/49** with MLX allocator at 162 GB active memory — each materialized contiguous backing buffer stayed live in the in-memory \`HashMap<String, MxArray>\` for the entire sharded save, blowing through 128 GB RAM. Both fixes are required to convert this checkpoint at all. ### Fix 1: CPU device + stream for convert Conversion does only slice / reshape / dtype-cast — no real math — so the CPU is semantically correct and immune to the Metal watchdog. A new RAII guard (\`ConvertDefaultStreamGuard\` / \`ConvertGgufDefaultStreamGuard\`) flips both \`set_default_device(CPU)\` AND \`set_default_stream(cpu_default)\` at the start of \`convert_model\` / \`convert_gguf_to_safetensors\` and restores the previous values on drop. Setting the stream alone is NOT enough — MLX dispatches stream-less ops via \`default_stream(default_device())\`, so the device pin is load-bearing. New FFI shims \`mlx_default_device\` / \`mlx_set_default_device\` are added to \`mlx-sys\`. ### Fix 2: Drain the tensor map as each tensor is written \`save_safetensors_single\` / \`_sharded\` / \`save_safetensors\` now take \`&mut HashMap<String, MxArray>\` and call \`.remove(name)\` after each tensor's bytes hit disk. This releases the MLX backing buffer immediately and keeps MLX active memory bounded at ~4.6 GB peak instead of growing unbounded. All callers updated: \`convert.rs\`, \`training_state.rs\`, \`gguf.rs\`, \`foreign_weights.rs\`, \`qwen3/qwen3_5/qwen3_5_moe/model.rs\`. ### Production logs \`info!\` level now exposes: - convert begin/end with structured fields (\`input_dir\`, \`output_dir\`, \`model_type\`, \`quantize\`, \`total_seconds\`, \`num_tensors\`, \`num_parameters\`) - per-shard timing, MB, avg MB/s, MLX \`active_mb\` / \`peak_mb\` / \`cache_mb\` - any single-tensor materialization ≥ 2 s (watchdog / cold-mmap signal) \`debug!\` level keeps the full per-tensor trace for deep debugging via \`MLX_NODE_LOG=\"mlx_core::utils::safetensors=debug\"\`. ## Verification (Qwen3.5 122B-A10B, 250 GB → bf16 MLX) | | Before | After | |---|---|---| | Result | Died at shard 3/49 (Metal watchdog), then shard 35/49 (OOM-kill) | ✓ 49/49 in 11:40 | | MLX peak memory | 162 GB | **4.6 GB** | | MLX active (steady-state) | growing unbounded | 0 MB | | Avg throughput | n/a (crash) | 334 MB/s sustained | \`cargo clippy --all-targets -- -D warnings\` and \`cargo fmt --check\` both clean. ## Test plan - [x] Qwen3.5 122B-A10B full bf16 conversion completes end-to-end (49 shards + index) - [x] \`cargo clippy --all-targets -- -D warnings\` - [x] \`cargo fmt --check\` - [ ] Spot-check that small / already-working conversions (Qwen3 0.6B, smaller MoE) still work — same code path now uses CPU stream, expected to be a no-op or trivially faster - [ ] Spot-check that GGUF→SafeTensors path is unaffected 🤖 Generated with [Claude Code](https://claude.com/claude-code)  --- > [!NOTE] > **Medium Risk** > Changes process-wide MLX default device/stream during convert (documented inference overlap risk) and mutates save APIs site-wide; behavior is intentional for CLI convert but embedders must serialize inference. > > **Overview** > Large HuggingFace / GGUF → MLX conversions are made reliable on huge MoE checkpoints by **routing convert work on CPU** and **releasing MLX memory as each tensor is written**. > > A new **`CpuConvertGuard`** temporarily sets MLX’s default **device and stream** to CPU for `convert_model` and `convert_gguf_to_safetensors`, then restores them on drop—avoiding Metal watchdog timeouts when materializing multi‑GB mmap-backed expert slices. A process-wide **`convert_mutex`** serializes conversions so global MLX defaults aren’t raced. **`mlx_default_device` / `mlx_set_default_device`** are added in `mlx-sys` to support this. > > **SafeTensors writers** now take `&mut HashMap<String, MxArray>` and **`.remove` each tensor after it’s serialized**, so backing buffers don’t accumulate through 49‑shard saves (fixes silent OOM on ~250 GB models). Call sites in convert, GGUF, foreign weights, Qwen saves, and optimizer state were updated; GGUF/foreign paths **snapshot tensor names before save** because the map may be drained. > > **Structured logging** was added for convert start/end, sharded save duration, per-shard throughput, MLX active/peak/cache MB, and slow (≥2 s) tensor materializations. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit 4ba9c3e. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup>  --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
diff --git a/crates/mlx-core/src/convert.rs b/crates/mlx-core/src/convert.rs
@@ -20,6 +20,81 @@ use crate::models::paddleocr_vl::persistence::load_paddleocr_vl_weights;
 use crate::models::qianfan_ocr::persistence::load_qianfan_ocr_weights;
 use crate::utils::safetensors::load_safetensors_lazy;
 
+/// RAII guard that pins the MLX default device + stream to CPU for one
+/// conversion call, then restores the previous values on drop.
+///
+/// Used by the conversion path to temporarily route every MLX op through
+/// CPU for the duration of one `convert_model` /
+/// `convert_gguf_to_safetensors` call. Both the default *device* and the
+/// default *stream* must be switched: MLX dispatches stream-less ops via
+/// `default_stream(default_device())`, so flipping the stream alone is
+/// not enough — the device must be CPU too. On drop, the previous
+/// device and stream are restored so subsequent inference / training
+/// calls keep using the GPU. See the call sites for the rationale.
+///
+/// MUST be acquired while holding `CONVERT_MUTEX`'s lock — otherwise two
+/// overlapping conversions can race on the process-wide MLX defaults and
+/// restore each other's `saved_*` fields incorrectly (e.g. both observe
+/// the already-flipped CPU device as "original", then both restore to
+/// CPU, leaving the process pinned to CPU for the next inference call).
+///
+/// **Concurrent-inference limitation (intentional):** `convert_mutex`
+/// only serializes convert-vs-convert. It does NOT block inference /
+/// training entrypoints. If a Node process runs `convert_model` while
+/// also serving inference, those inference ops resolve their stream via
+/// `default_stream(default_device())` and will be silently routed to
+/// CPU until the conversion finishes — typically minutes to hours on
+/// large MoE checkpoints, with severe latency degradation. The
+/// architecturally correct fix is to plumb explicit `Stream` arguments
+/// through every convert-used MLX FFI op so the global default is never
+/// touched; that's a substantial refactor outside the scope of this
+/// change. For the supported usage today (the `mlx convert` CLI exits
+/// after conversion; no other entrypoint in this codebase invokes
+/// convert), this is a non-issue. Callers who embed convert inside a
+/// long-lived multi-tenant Node process should serialize their own
+/// inference against convert externally.
+pub(crate) struct CpuConvertGuard {
+    saved_device: i32,
+    saved_stream: mlx_sys::mlx_stream,
+}
+
+impl CpuConvertGuard {
+    /// Enter the CPU device + stream. The caller is responsible for holding
+    /// `CONVERT_MUTEX` for the lifetime of the returned guard.
+    pub(crate) fn enter_cpu() -> Self {
+        let saved_device = unsafe { mlx_sys::mlx_default_device() };
+        let saved_stream = unsafe { mlx_sys::mlx_default_stream(saved_device) };
+        unsafe { mlx_sys::mlx_set_default_device(0) };
+        let cpu_stream = unsafe { mlx_sys::mlx_default_stream(0) };
+        unsafe { mlx_sys::mlx_set_default_stream(cpu_stream) };
+        Self {
+            saved_device,
+            saved_stream,
+        }
+    }
+}
+
+impl Drop for CpuConvertGuard {
+    fn drop(&mut self) {
+        unsafe { mlx_sys::mlx_set_default_stream(self.saved_stream) };
+        unsafe { mlx_sys::mlx_set_default_device(self.saved_device) };
+    }
+}
+
+/// Process-wide async mutex serializing all conversion calls.
+///
+/// `convert_model` and `convert_gguf_to_safetensors` mutate MLX's
+/// process-wide default device + default stream via `CpuConvertGuard`,
+/// which is unsafe under concurrency: two overlapping conversions (or a
+/// convert during inference that depends on the GPU default) can race on
+/// the global state. Both NAPI entrypoints `.await` this mutex before
+/// constructing a `CpuConvertGuard`, so only one conversion runs at a
+/// time across the entire Node process.
+pub(crate) fn convert_mutex() -> &'static tokio::sync::Mutex<()> {
+    static CONVERT_MUTEX: std::sync::OnceLock<tokio::sync::Mutex<()>> = std::sync::OnceLock::new();
+    CONVERT_MUTEX.get_or_init(|| tokio::sync::Mutex::new(()))
+}
+
 /// Structure for parsing model.safetensors.index.json
 #[derive(Debug, Deserialize)]
 struct ShardedModelIndex {
@@ -115,6 +190,39 @@ pub struct ConversionResult {
 /// ```
 #[napi]
 pub async fn convert_model(options: ConversionOptions) -> Result<ConversionResult> {
+    let _convert_start = std::time::Instant::now();
+    info!(
+        target = "mlx_core::convert",
+        input_dir = %options.input_dir,
+        output_dir = %options.output_dir,
+        dtype = ?options.dtype,
+        model_type = ?options.model_type,
+        quantize = options.quantize.unwrap_or(false),
+        quant_mode = ?options.quant_mode,
+        quant_recipe = ?options.quant_recipe,
+        "convert_model start"
+    );
+    let result = convert_model_inner(options).await;
+    match &result {
+        Ok(r) => info!(
+            target = "mlx_core::convert",
+            total_seconds = _convert_start.elapsed().as_secs_f64(),
+            num_tensors = r.num_tensors,
+            num_parameters = r.num_parameters,
+            output_path = %r.output_path,
+            "convert_model finished"
+        ),
+        Err(e) => tracing::error!(
+            target = "mlx_core::convert",
+            total_seconds = _convert_start.elapsed().as_secs_f64(),
+            error = %e,
+            "convert_model failed"
+        ),
+    }
+    result
+}
+
+async fn convert_model_inner(options: ConversionOptions) -> Result<ConversionResult> {
     let input_dir = PathBuf::from(&options.input_dir);
     let output_dir = PathBuf::from(&options.output_dir);
     let target_dtype = options.dtype.unwrap_or_else(|| "float32".to_string());
@@ -249,6 +357,23 @@ pub async fn convert_model(options: ConversionOptions) -> Result<ConversionResul
         )));
     }
 
+    // Serialize all conversions process-wide before touching MLX's default
+    // device + stream — see `convert_mutex` and `CpuConvertGuard` docs for
+    // the race this avoids.
+    let _convert_lock = convert_mutex().lock().await;
+
+    // Route every MLX op in this conversion through the CPU device + stream.
+    //
+    // The conversion path is slice / reshape / dtype-cast only — no real math.
+    // On GPU, materializing a 1.6 GB sliced view of a fused expert tensor backed
+    // by a 250 GB mmap'd source can stall a Metal command buffer past the macOS
+    // GPU watchdog (~5 s), surfacing as
+    // `kIOGPUCommandBufferCallbackErrorTimeout` mid-shard for large MoE models
+    // (e.g. Qwen3.5 122B-A10B with 256 experts × 48 layers). CPU has direct
+    // access to the mmap'd pages and is immune to the watchdog. `_stream_guard`
+    // restores the prior default device + stream when convert_model returns.
+    let _stream_guard = CpuConvertGuard::enter_cpu();
+
     // Check for required files
     let config_path = input_dir.join("config.json");
     if !config_path.exists() {
@@ -660,9 +785,20 @@ pub async fn convert_model(options: ConversionOptions) -> Result<ConversionResul
     tensor_names.sort();
 
     // Save converted model — sharded output with index file (mlx-lm/mlx-vlm compatible)
-    info!("Saving converted model to: {}", output_dir.display());
+    info!(
+        target = "mlx_core::convert",
+        output_dir = %output_dir.display(),
+        num_tensors = converted_tensors.len(),
+        "starting sharded save"
+    );
 
-    crate::utils::safetensors::save_safetensors_sharded(&output_dir, &converted_tensors)?;
+    let save_start = std::time::Instant::now();
+    crate::utils::safetensors::save_safetensors_sharded(&output_dir, &mut converted_tensors)?;
+    info!(
+        target = "mlx_core::convert",
+        save_seconds = save_start.elapsed().as_secs_f64(),
+        "sharded save complete"
+    );
 
     // Write config.json — clean and sort keys to match mlx-lm/mlx-vlm save_config
     let output_config_path = output_dir.join("config.json");
diff --git a/crates/mlx-core/src/models/qwen3/model.rs b/crates/mlx-core/src/models/qwen3/model.rs
@@ -4479,7 +4479,7 @@ impl Qwen3Inner {
             }
         }
 
-        let params_clone: HashMap<String, MxArray> =
+        let mut params_clone: HashMap<String, MxArray> =
             params.iter().map(|(k, v)| (k.clone(), v.clone())).collect();
 
         // Build weights.mlx metadata (shape + dtype only; full data is in safetensors).
@@ -4528,7 +4528,11 @@ impl Qwen3Inner {
             "format": "mlx-node",
             "version": "1.0"
         }));
-        crate::utils::safetensors::save_safetensors(&safetensors_path, &params_clone, metadata)?;
+        crate::utils::safetensors::save_safetensors(
+            &safetensors_path,
+            &mut params_clone,
+            metadata,
+        )?;
         info!("Saved weights.safetensors");
 
         let weights_str = serde_json::to_string_pretty(&weights_json)?;
diff --git a/crates/mlx-core/src/models/qwen3_5/model.rs b/crates/mlx-core/src/models/qwen3_5/model.rs
@@ -1353,7 +1353,7 @@ impl Qwen35Inner {
             }
         }
 
-        let params_clone: HashMap<String, MxArray> =
+        let mut params_clone: HashMap<String, MxArray> =
             params.iter().map(|(k, v)| (k.clone(), v.clone())).collect();
 
         // Weights metadata
@@ -1401,7 +1401,11 @@ impl Qwen35Inner {
             "format": "mlx-node",
             "version": "1.0"
         }));
-        crate::utils::safetensors::save_safetensors(&safetensors_path, &params_clone, metadata)?;
+        crate::utils::safetensors::save_safetensors(
+            &safetensors_path,
+            &mut params_clone,
+            metadata,
+        )?;
         info!("Saved weights.safetensors");
 
         let weights_str = serde_json::to_string_pretty(&weights_json)?;
diff --git a/crates/mlx-core/src/models/qwen3_5_moe/model.rs b/crates/mlx-core/src/models/qwen3_5_moe/model.rs
@@ -5303,7 +5303,7 @@ impl Qwen35MoeInner {
             }
         }
 
-        let params_clone: HashMap<String, MxArray> =
+        let mut params_clone: HashMap<String, MxArray> =
             params.iter().map(|(k, v)| (k.clone(), v.clone())).collect();
 
         // Weights metadata (reference sidecar)
@@ -5351,7 +5351,11 @@ impl Qwen35MoeInner {
             "format": "mlx-node",
             "version": "1.0"
         }));
-        crate::utils::safetensors::save_safetensors(&safetensors_path, &params_clone, metadata)?;
+        crate::utils::safetensors::save_safetensors(
+            &safetensors_path,
+            &mut params_clone,
+            metadata,
+        )?;
         info!("Saved weights.safetensors");
 
         let weights_str = serde_json::to_string_pretty(&weights_json)?;
diff --git a/crates/mlx-core/src/training_state.rs b/crates/mlx-core/src/training_state.rs
@@ -124,7 +124,7 @@ impl ModelThreadTrainingState {
             "step": step.to_string(),
             "format": "adamw_optimizer_state",
         });
-        crate::utils::safetensors::save_safetensors(path, &tensors, Some(metadata))
+        crate::utils::safetensors::save_safetensors(path, &mut tensors, Some(metadata))
     }
 
     /// Restore AdamW moment tensors + step from a SafeTensors file.
@@ -389,7 +389,7 @@ mod tests {
             let arr = MxArray::from_float32(&[*val], &[1]).unwrap();
             tensor_map.insert(key.to_string(), arr);
         }
-        crate::utils::safetensors::save_safetensors(path, &tensor_map, metadata).unwrap();
+        crate::utils::safetensors::save_safetensors(path, &mut tensor_map, metadata).unwrap();
     }
 
     // =========================================================================
diff --git a/crates/mlx-core/src/utils/foreign_weights.rs b/crates/mlx-core/src/utils/foreign_weights.rs
@@ -69,7 +69,7 @@ pub fn convert_foreign_weights(
         ))
     })?;
 
-    let (tensors, config_json) = match options.model_type.as_str() {
+    let (mut tensors, config_json) = match options.model_type.as_str() {
         "pp-lcnet-ori" => convert_pp_lcnet_ori(&input_path, verbose)?,
         "uvdoc" => convert_uvdoc(&input_path, verbose)?,
         other => {
@@ -81,9 +81,9 @@ pub fn convert_foreign_weights(
 
     // Save SafeTensors
     let weights_path = output_dir.join("model.safetensors");
-    save_safetensors(&weights_path, &tensors, None)?;
-
+    // Capture names BEFORE save (save drains the map for memory reasons).
     let mut tensor_names: Vec<String> = tensors.keys().cloned().collect();
+    save_safetensors(&weights_path, &mut tensors, None)?;
     tensor_names.sort();
     let num_tensors = tensor_names.len() as i32;
 
diff --git a/crates/mlx-core/src/utils/gguf.rs b/crates/mlx-core/src/utils/gguf.rs
@@ -1279,6 +1279,15 @@ pub async fn convert_gguf_to_safetensors(
         )));
     }
 
+    // Serialize all conversions process-wide before touching MLX's default
+    // device + stream. Then route every MLX op through CPU for the duration
+    // of this call. See `crate::convert::convert_mutex` and
+    // `crate::convert::CpuConvertGuard` for the full rationale — same
+    // reasoning applies here for GGUF→SafeTensors conversion of huge MoE
+    // checkpoints.
+    let _convert_lock = crate::convert::convert_mutex().lock().await;
+    let _stream_guard = crate::convert::CpuConvertGuard::enter_cpu();
+
     // Parse GGUF header and metadata
     info!("Parsing GGUF file: {}", input_path.display());
     let gguf = parse_gguf(&input_path)?;
@@ -1569,10 +1578,16 @@ pub async fn convert_gguf_to_safetensors(
         .unwrap_or("model.safetensors");
     let safetensors_path = output_dir.join(safetensors_filename);
     info!("Saving to {}", safetensors_path.display());
+    // Capture tensor names BEFORE `save_safetensors` — it drains `weights`
+    // as it streams each tensor to disk so MLX-allocated backing buffers
+    // can be released immediately on large MoE checkpoints. Reading
+    // `weights.keys()` after the save would return an empty list and the
+    // GgufConversionResult would report num_tensors = 0 to JS callers.
+    let tensor_names: Vec<String> = weights.keys().cloned().collect();
     // Add "format: mlx" metadata so loaders (e.g., mlx-vlm) know weights are
     // already in MLX layout and skip sanitize (which would double-apply +1.0 to norms).
     let st_metadata = serde_json::json!({ "format": "mlx" });
-    save_safetensors(&safetensors_path, &weights, Some(st_metadata))?;
+    save_safetensors(&safetensors_path, &mut weights, Some(st_metadata))?;
 
     // Only write config.json and tokenizer files for the primary model file.
     // Secondary files (e.g., vision.safetensors for mmproj) should not overwrite
@@ -1652,8 +1667,6 @@ pub async fn convert_gguf_to_safetensors(
         .collect::<Vec<_>>()
         .join(", ");
 
-    let tensor_names: Vec<String> = weights.keys().cloned().collect();
-
     Ok(GgufConversionResult {
         num_tensors: tensor_names.len() as i32,
         num_parameters,
diff --git a/crates/mlx-core/src/utils/safetensors.rs b/crates/mlx-core/src/utils/safetensors.rs
diff --git a/crates/mlx-sys/src/lib.rs b/crates/mlx-sys/src/lib.rs
diff --git a/crates/mlx-sys/src/mlx_stream.cpp b/crates/mlx-sys/src/mlx_stream.cpp

Original file line number	Diff line number	Diff line change
`@@ -124,7 +124,7 @@ impl ModelThreadTrainingState {`
`124`	`124`	`"step": step.to_string(),`
`125`	`125`	`"format": "adamw_optimizer_state",`
`126`	`126`	`});`
`127`		`- crate::utils::safetensors::save_safetensors(path, &tensors, Some(metadata))`
	`127`	`+ crate::utils::safetensors::save_safetensors(path, &mut tensors, Some(metadata))`
`128`	`128`	`}`
`129`	`129`
`130`	`130`	`/// Restore AdamW moment tensors + step from a SafeTensors file.`
`@@ -389,7 +389,7 @@ mod tests {`
`389`	`389`	`let arr = MxArray::from_float32(&[*val], &[1]).unwrap();`
`390`	`390`	`tensor_map.insert(key.to_string(), arr);`
`391`	`391`	`}`
`392`		`- crate::utils::safetensors::save_safetensors(path, &tensor_map, metadata).unwrap();`
	`392`	`+ crate::utils::safetensors::save_safetensors(path, &mut tensor_map, metadata).unwrap();`
`393`	`393`	`}`
`394`	`394`
`395`	`395`	`// =========================================================================`