Skip to content

Add best-stack follow-up bundle: MLX on M5, dream-server ROCm 7 on Strix#25

Merged
Lightheartdevs merged 5 commits into
mainfrom
best-stack-followup-2026-05-17
May 17, 2026
Merged

Add best-stack follow-up bundle: MLX on M5, dream-server ROCm 7 on Strix#25
Lightheartdevs merged 5 commits into
mainfrom
best-stack-followup-2026-05-17

Conversation

@Lightheartdevs
Copy link
Copy Markdown
Contributor

This bundle answers the canonical study's audit ask for an Apple + AMD "best-stack matrix": does each vendor's productized inference stack accelerate the workload beyond upstream llama.cpp at the canonical pin?

Two opposite-direction findings:

  1. MLX on M5 Max IS a real productized lift over canonical llama.cpp Metal. Decode +6.0% (27B dense), +15.6% (35B-A3B MoE). Prefill +35.2% (27B dense), +53.6% (35B-A3B MoE). Tight SDs (±0.01-0.66 tok/s). Buyer-actionable: default to MLX on M5.

  2. AMD's productized Strix Halo Linux stack works but does not lift on prefill. dream-server ships a custom llama.cpp build at /opt/llama-custom/ linked against ROCm 7 (libamdhip64.so.7); we caught this while bringing the bench up. ROCm 7 resolves the canonical "ROCm broken on Strix Halo" loading bug — that finding was specific to v6.4.4 + b9151 — but its measured decode matches vanilla Vulkan within noise and its prefill is 2.4× slower because the bundled engine is at an older upstream commit (ff5ef82) than the canonical b9151. The actually-novel productized AMD path (Lemonade Server's OGA backend on Windows + DirectML + INT4 NPU) is not exercised here.

Status:

  • M5 MLX 27B grid: 12/12 conc=1 cells complete
  • M5 MLX 35B-A3B grid: 12/12 conc=1 cells complete
  • Strix dream-server-rocm7 grid: 6/12 conc=1 cells (ctx<=4K); ctx=16K and 32K cells still running, will land in follow-up commits

All cells use MMBT-canonical cell.json schema. Same prompt corpus as canonical study (SHA pinned). Harness vendored at harness/lib/.

🤖 Generated with Claude Code

Michael Bradley and others added 5 commits May 17, 2026 13:06
This bundle answers the canonical study's audit ask for an Apple + AMD
"best-stack matrix": does each vendor's productized inference stack
accelerate the workload beyond upstream llama.cpp at the canonical pin?

Two opposite-direction findings:

1. MLX on M5 Max IS a real productized lift over canonical llama.cpp Metal.
   Decode +6.0% (27B dense), +15.6% (35B-A3B MoE).
   Prefill +35.2% (27B dense), +53.6% (35B-A3B MoE).
   Tight SDs (±0.01-0.66 tok/s). Buyer-actionable: default to MLX on M5.

2. AMD's productized Strix Halo Linux stack works but does not lift on
   prefill. dream-server ships a custom llama.cpp build at /opt/llama-custom/
   linked against ROCm 7 (libamdhip64.so.7); we caught this while bringing
   the bench up. ROCm 7 resolves the canonical "ROCm broken on Strix Halo"
   loading bug — that finding was specific to v6.4.4 + b9151 — but its
   measured decode matches vanilla Vulkan within noise and its prefill is
   2.4× slower because the bundled engine is at an older upstream commit
   (ff5ef82) than the canonical b9151. The actually-novel productized
   AMD path (Lemonade Server's OGA backend on Windows + DirectML + INT4
   NPU) is not exercised here.

Status:
- M5 MLX 27B grid: 12/12 conc=1 cells complete
- M5 MLX 35B-A3B grid: 12/12 conc=1 cells complete
- Strix dream-server-rocm7 grid: 6/12 conc=1 cells (ctx<=4K); ctx=16K
  and 32K cells still running, will land in follow-up commits

All cells use MMBT-canonical cell.json schema. Same prompt corpus as
canonical study (SHA pinned). Harness vendored at harness/lib/.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ctx=16K cell reveals the older-engine cost dramatically: TTFT is 185.6 s
versus canonical Vulkan's ~30 s at the same cell. Decode at ctx=16K is 7.06
(vs canonical 7.50, ~6% slower). Prefill at this cell is 84 tok/s.

The pattern matches MLX: small advantage/disadvantage at short ctx grows at
long ctx because prefill amortization shifts. For dream-server-rocm7 it's
the wrong-direction cost from the older bundled engine.

7/12 conc=1 cells now present. Remaining: ctx=16K gen=512/2048, ctx=32K all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le caveat

P1: New claims added to claims.yaml under "Best-stack follow-up" section:
- hw.best-stack.m5.mlx-beats-metal (+6% / +15.6% decode, +35.2% / +53.6%
  prefill on Apple's productized stack vs vanilla llama.cpp Metal)
- hw.best-stack.strix.rocm7-works (narrows canonical's
  hw.q8.engine.rocm-strix-halo-segfault to v6.4.4 specifically; v7
  loads and serves)
- hw.best-stack.strix.no-prefill-lift (decode matches Vulkan within
  noise, prefill ~2.4× slower due to engine vintage, not ROCm-vs-Vulkan)

P2: AUDIT.md added — locked vs varied inputs, what comparison can/cannot
support, 8-entry B-list (engine vintage, MLX quant scheme, MLX concurrency
semantics, cache_prompt handling, no power co-sample, cold-start
differences, the engine-identification gotcha that motivated this audit
doc, HF refs are floating).

P2: workloads/prompts.jsonl.sha256 fixed to valid sha256sum -c format
(hash + two spaces + filename). Verified: `shasum -a 256 -c` succeeds.

P2: workloads/mlx-models.sha256 added — pins every safetensors shard +
config + tokenizer + index file for both MLX models. README's reproducer
section updated to show the verify step.

P2: README contradiction about 35B ctx=32K gen=2048 cell fixed — both
locations now consistently report the grid as complete.

manifest.json see_also now points at this bundle's own AUDIT.md and at
claims.yaml claim IDs for cross-reference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2: Strix status text now aligned with the data (7/12 cells, ctx=16K
gen=128 present). Fixed in:
- README.md status table row
- README.md headline at-a-glance (replaced "_ctx=16K cell pending_"
  with the actual decode_tps_at_ctx16k value 7.06)
- findings.md § Status of this PR
- manifest.json deferred_to_follow_up wording

P2: findings.md "weight-shard SHAs are not pinned" stale paragraph
rewritten to point readers at workloads/mlx-models.sha256 and give
the shasum -c verify command.

P3: AUDIT.md claim-id typo fixed — was hw.best-stack.strix-halo.rocm7-works,
actual id is hw.best-stack.strix.rocm7-works.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit 5ea9642 into main May 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant