Skip to content

Metal: embed() hangs sporadically on Apple M5 Max (~10-30% of calls) #595

@jowitte

Description

@jowitte

Bug

embed() hangs sporadically on Apple M5 Max with Metal backend. The same query succeeds in ~0.7s on one call and hangs indefinitely (99% CPU on one core) on the next. The behavior is not query-dependent — identical inputs produce different outcomes. CPU-only mode (gpu: false) is 100% stable but ~50x slower.

Reproduction

Minimal test: run the same embedding call 10 times sequentially. Expect 1-3 hangs per 10 runs.

import { getLlama } from "node-llama-cpp";

const llama = await getLlama(); // Metal auto-detected
const model = await llama.loadModel({
  modelPath: "nomic-embed-text-v2-moe.Q8_0.gguf"  // 488 MB MoE embedding model
});
const ctx = await model.createEmbeddingContext();

for (let i = 0; i < 10; i++) {
  console.time(`run ${i}`);
  const vec = await ctx.getEmbeddingFor("test query about delegation");
  console.timeEnd(`run ${i}`);  // ~0.7s when it works, never completes when it hangs
}

In practice, observed via qmd CLI which calls embed() for vector search:

# Run 10 sequential vsearch calls, 8s timeout each:
export GGML_METAL_NO_RESIDENCY=1
for i in $(seq 1 10); do
  bun dist/cli/qmd.js vsearch "Delegieren" -n 3 &
  pid=$!; sleep 8
  kill -0 $pid 2>/dev/null && { kill $pid; echo "Run $i: HANG"; } || echo "Run $i: OK"
done

# Typical result: 9 OK, 1 HANG (without GGML_METAL_NO_RESIDENCY: 5-7 OK, 3-5 HANG)

Workaround

GGML_METAL_NO_RESIDENCY=1 reduces hang rate from ~30-50% to ~10% but does not eliminate it. This env var disables Metal residency sets for buffer allocation (related to llama.cpp autorelease/buffer management, see ggml-org/llama.cpp#18568).

What doesn't help

  • Upgrading llama.cpp: Rebuilt with b8783 (from b8390 in 3.18.1) — hang rate actually increased to ~40%
  • GGML_METAL_TENSOR_DISABLE=1: No improvement, possibly worse
  • Both flags combined: Worse than GGML_METAL_NO_RESIDENCY alone
  • AbortSignal: signal parameter on prompt() is not respected during native Metal compute — the call blocks the event loop

Observations

  • The hang occurs in the native Metal compute pipeline, not in JS
  • Process shows 99-100% CPU on one core during hang
  • No crash, no error, no output — just infinite computation
  • The model (nomic-embed-text-v2-moe) is a Mixture-of-Experts architecture — MoE routing may trigger different Metal kernel paths
  • First call after process start seems slightly more likely to hang (cold start)
  • Concurrent embed() calls guarantee deadlock (separate known issue)

Related issues

Environment

  • node-llama-cpp: 3.18.1
  • llama.cpp: b8390 (prebuilt)
  • Hardware: Apple M5 Max, 128 GB RAM
  • OS: macOS 26.4.1 (Tahoe)
  • Runtime: Bun 1.3.12 (also tested with Node 25.9.0 — same behavior)
  • Model: nomic-embed-text-v2-moe Q8_0 (488 MB, 768 dimensions, MoE)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions