You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Summary
`mlx convert` on a 250 GB Qwen3.5 122B-A10B checkpoint (256 experts × 48
layers) failed two different ways:
1. **macOS Metal watchdog kill** (~5 s) when materializing a 1.6 GB
sliced view of a fused \`experts.gate_up_proj\` backed by a cold mmap'd
HF shard — surfaces as \`kIOGPUCommandBufferCallbackErrorTimeout\`
mid-shard.
2. **Silent OOM-kill at shard 35/49** with MLX allocator at 162 GB
active memory — each materialized contiguous backing buffer stayed live
in the in-memory \`HashMap<String, MxArray>\` for the entire sharded
save, blowing through 128 GB RAM.
Both fixes are required to convert this checkpoint at all.
### Fix 1: CPU device + stream for convert
Conversion does only slice / reshape / dtype-cast — no real math — so
the CPU is semantically correct and immune to the Metal watchdog. A new
RAII guard (\`ConvertDefaultStreamGuard\` /
\`ConvertGgufDefaultStreamGuard\`) flips both
\`set_default_device(CPU)\` AND \`set_default_stream(cpu_default)\` at
the start of \`convert_model\` / \`convert_gguf_to_safetensors\` and
restores the previous values on drop.
Setting the stream alone is NOT enough — MLX dispatches stream-less ops
via \`default_stream(default_device())\`, so the device pin is
load-bearing. New FFI shims \`mlx_default_device\` /
\`mlx_set_default_device\` are added to \`mlx-sys\`.
### Fix 2: Drain the tensor map as each tensor is written
\`save_safetensors_single\` / \`_sharded\` / \`save_safetensors\` now
take \`&mut HashMap<String, MxArray>\` and call \`.remove(name)\` after
each tensor's bytes hit disk. This releases the MLX backing buffer
immediately and keeps MLX active memory bounded at ~4.6 GB peak instead
of growing unbounded.
All callers updated: \`convert.rs\`, \`training_state.rs\`, \`gguf.rs\`,
\`foreign_weights.rs\`, \`qwen3/qwen3_5/qwen3_5_moe/model.rs\`.
### Production logs
\`info!\` level now exposes:
- convert begin/end with structured fields (\`input_dir\`,
\`output_dir\`, \`model_type\`, \`quantize\`, \`total_seconds\`,
\`num_tensors\`, \`num_parameters\`)
- per-shard timing, MB, avg MB/s, MLX \`active_mb\` / \`peak_mb\` /
\`cache_mb\`
- any single-tensor materialization ≥ 2 s (watchdog / cold-mmap signal)
\`debug!\` level keeps the full per-tensor trace for deep debugging via
\`MLX_NODE_LOG=\"mlx_core::utils::safetensors=debug\"\`.
## Verification (Qwen3.5 122B-A10B, 250 GB → bf16 MLX)
| | Before | After |
|---|---|---|
| Result | Died at shard 3/49 (Metal watchdog), then shard 35/49
(OOM-kill) | ✓ 49/49 in 11:40 |
| MLX peak memory | 162 GB | **4.6 GB** |
| MLX active (steady-state) | growing unbounded | 0 MB |
| Avg throughput | n/a (crash) | 334 MB/s sustained |
\`cargo clippy --all-targets -- -D warnings\` and \`cargo fmt --check\`
both clean.
## Test plan
- [x] Qwen3.5 122B-A10B full bf16 conversion completes end-to-end (49
shards + index)
- [x] \`cargo clippy --all-targets -- -D warnings\`
- [x] \`cargo fmt --check\`
- [ ] Spot-check that small / already-working conversions (Qwen3 0.6B,
smaller MoE) still work — same code path now uses CPU stream, expected
to be a no-op or trivially faster
- [ ] Spot-check that GGUF→SafeTensors path is unaffected
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- CURSOR_SUMMARY -->
---
> [!NOTE]
> **Medium Risk**
> Changes process-wide MLX default device/stream during convert
(documented inference overlap risk) and mutates save APIs site-wide;
behavior is intentional for CLI convert but embedders must serialize
inference.
>
> **Overview**
> Large HuggingFace / GGUF → MLX conversions are made reliable on huge
MoE checkpoints by **routing convert work on CPU** and **releasing MLX
memory as each tensor is written**.
>
> A new **`CpuConvertGuard`** temporarily sets MLX’s default **device
and stream** to CPU for `convert_model` and
`convert_gguf_to_safetensors`, then restores them on drop—avoiding Metal
watchdog timeouts when materializing multi‑GB mmap-backed expert slices.
A process-wide **`convert_mutex`** serializes conversions so global MLX
defaults aren’t raced. **`mlx_default_device` /
`mlx_set_default_device`** are added in `mlx-sys` to support this.
>
> **SafeTensors writers** now take `&mut HashMap<String, MxArray>` and
**`.remove` each tensor after it’s serialized**, so backing buffers
don’t accumulate through 49‑shard saves (fixes silent OOM on ~250 GB
models). Call sites in convert, GGUF, foreign weights, Qwen saves, and
optimizer state were updated; GGUF/foreign paths **snapshot tensor names
before save** because the map may be drained.
>
> **Structured logging** was added for convert start/end, sharded save
duration, per-shard throughput, MLX active/peak/cache MB, and slow
(≥2 s) tensor materializations.
>
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
4ba9c3e. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
0 commit comments