Add best-stack follow-up bundle: MLX on M5, dream-server ROCm 7 on Strix#25
Merged
Conversation
This bundle answers the canonical study's audit ask for an Apple + AMD "best-stack matrix": does each vendor's productized inference stack accelerate the workload beyond upstream llama.cpp at the canonical pin? Two opposite-direction findings: 1. MLX on M5 Max IS a real productized lift over canonical llama.cpp Metal. Decode +6.0% (27B dense), +15.6% (35B-A3B MoE). Prefill +35.2% (27B dense), +53.6% (35B-A3B MoE). Tight SDs (±0.01-0.66 tok/s). Buyer-actionable: default to MLX on M5. 2. AMD's productized Strix Halo Linux stack works but does not lift on prefill. dream-server ships a custom llama.cpp build at /opt/llama-custom/ linked against ROCm 7 (libamdhip64.so.7); we caught this while bringing the bench up. ROCm 7 resolves the canonical "ROCm broken on Strix Halo" loading bug — that finding was specific to v6.4.4 + b9151 — but its measured decode matches vanilla Vulkan within noise and its prefill is 2.4× slower because the bundled engine is at an older upstream commit (ff5ef82) than the canonical b9151. The actually-novel productized AMD path (Lemonade Server's OGA backend on Windows + DirectML + INT4 NPU) is not exercised here. Status: - M5 MLX 27B grid: 12/12 conc=1 cells complete - M5 MLX 35B-A3B grid: 12/12 conc=1 cells complete - Strix dream-server-rocm7 grid: 6/12 conc=1 cells (ctx<=4K); ctx=16K and 32K cells still running, will land in follow-up commits All cells use MMBT-canonical cell.json schema. Same prompt corpus as canonical study (SHA pinned). Harness vendored at harness/lib/. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ctx=16K cell reveals the older-engine cost dramatically: TTFT is 185.6 s versus canonical Vulkan's ~30 s at the same cell. Decode at ctx=16K is 7.06 (vs canonical 7.50, ~6% slower). Prefill at this cell is 84 tok/s. The pattern matches MLX: small advantage/disadvantage at short ctx grows at long ctx because prefill amortization shifts. For dream-server-rocm7 it's the wrong-direction cost from the older bundled engine. 7/12 conc=1 cells now present. Remaining: ctx=16K gen=512/2048, ctx=32K all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le caveat P1: New claims added to claims.yaml under "Best-stack follow-up" section: - hw.best-stack.m5.mlx-beats-metal (+6% / +15.6% decode, +35.2% / +53.6% prefill on Apple's productized stack vs vanilla llama.cpp Metal) - hw.best-stack.strix.rocm7-works (narrows canonical's hw.q8.engine.rocm-strix-halo-segfault to v6.4.4 specifically; v7 loads and serves) - hw.best-stack.strix.no-prefill-lift (decode matches Vulkan within noise, prefill ~2.4× slower due to engine vintage, not ROCm-vs-Vulkan) P2: AUDIT.md added — locked vs varied inputs, what comparison can/cannot support, 8-entry B-list (engine vintage, MLX quant scheme, MLX concurrency semantics, cache_prompt handling, no power co-sample, cold-start differences, the engine-identification gotcha that motivated this audit doc, HF refs are floating). P2: workloads/prompts.jsonl.sha256 fixed to valid sha256sum -c format (hash + two spaces + filename). Verified: `shasum -a 256 -c` succeeds. P2: workloads/mlx-models.sha256 added — pins every safetensors shard + config + tokenizer + index file for both MLX models. README's reproducer section updated to show the verify step. P2: README contradiction about 35B ctx=32K gen=2048 cell fixed — both locations now consistently report the grid as complete. manifest.json see_also now points at this bundle's own AUDIT.md and at claims.yaml claim IDs for cross-reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P2: Strix status text now aligned with the data (7/12 cells, ctx=16K gen=128 present). Fixed in: - README.md status table row - README.md headline at-a-glance (replaced "_ctx=16K cell pending_" with the actual decode_tps_at_ctx16k value 7.06) - findings.md § Status of this PR - manifest.json deferred_to_follow_up wording P2: findings.md "weight-shard SHAs are not pinned" stale paragraph rewritten to point readers at workloads/mlx-models.sha256 and give the shasum -c verify command. P3: AUDIT.md claim-id typo fixed — was hw.best-stack.strix-halo.rocm7-works, actual id is hw.best-stack.strix.rocm7-works. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This bundle answers the canonical study's audit ask for an Apple + AMD "best-stack matrix": does each vendor's productized inference stack accelerate the workload beyond upstream llama.cpp at the canonical pin?
Two opposite-direction findings:
MLX on M5 Max IS a real productized lift over canonical llama.cpp Metal. Decode +6.0% (27B dense), +15.6% (35B-A3B MoE). Prefill +35.2% (27B dense), +53.6% (35B-A3B MoE). Tight SDs (±0.01-0.66 tok/s). Buyer-actionable: default to MLX on M5.
AMD's productized Strix Halo Linux stack works but does not lift on prefill. dream-server ships a custom llama.cpp build at /opt/llama-custom/ linked against ROCm 7 (libamdhip64.so.7); we caught this while bringing the bench up. ROCm 7 resolves the canonical "ROCm broken on Strix Halo" loading bug — that finding was specific to v6.4.4 + b9151 — but its measured decode matches vanilla Vulkan within noise and its prefill is 2.4× slower because the bundled engine is at an older upstream commit (ff5ef82) than the canonical b9151. The actually-novel productized AMD path (Lemonade Server's OGA backend on Windows + DirectML + INT4 NPU) is not exercised here.
Status:
All cells use MMBT-canonical cell.json schema. Same prompt corpus as canonical study (SHA pinned). Harness vendored at harness/lib/.
🤖 Generated with Claude Code