Skip to content

Commit 5ea9642

Browse files
Merge pull request #25 from Light-Heart-Labs/best-stack-followup-2026-05-17
Add best-stack follow-up bundle: MLX on M5, dream-server ROCm 7 on Strix
2 parents 9c4a8f3 + 2b0c1dd commit 5ea9642

171 files changed

Lines changed: 3521 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

claims.yaml

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -171,6 +171,67 @@ claims:
171171
- "Strix Halo's measured row uses Vulkan; ROCm was broken in this snapshot."
172172
- "M5 Max's measured row uses Metal/macOS, not a CUDA/Linux serving stack."
173173

174+
# ─── Best-stack follow-up: vendor productized stacks vs canonical llama.cpp
175+
# (hardware-tests/best-stack-followup-2026-05-17) ───
176+
177+
- id: hw.best-stack.m5.mlx-beats-metal
178+
text: >
179+
On the M5 Max MacBook Pro 16", Apple's productized MLX stack
180+
(mlx-lm 0.31.3) running mlx-community Qwen3.6-27B-8bit and
181+
Qwen3.6-35B-A3B-8bit beats vanilla llama.cpp Metal at canonical
182+
pin b9151 across every measured single-user (conc=1) cell.
183+
Peak decode lift: +6.0% on the dense 27B (17.78 vs 16.78 tok/s),
184+
+15.6% on the 35B-A3B MoE (102.71 vs 88.87 tok/s). Peak prefill
185+
lift: +35.2% on 27B (773.2 vs 571.8 tok/s), +53.6% on 35B-A3B
186+
(4124.6 vs 2684.9 tok/s). The lift is larger on MoE than dense,
187+
consistent with MLX shipping more specialized sparse/batched
188+
kernels than b9151's generic Metal path.
189+
status: provisional
190+
scope: single-user, conc=1, Qwen3.6 dense + MoE, M5 Max, ~8-bit quant
191+
evidence: "hardware-tests/best-stack-followup-2026-05-17/findings.md § Finding 1 + aggregate/headline.csv"
192+
caveats:
193+
- "MLX 8-bit uses a different quantization scheme than GGUF Q8_0; same-precision-different-bytes. No quality diff measured."
194+
- "Vanilla llama.cpp Metal pin is b9151. A newer llama.cpp build may have caught up; not tested."
195+
- "Single study, single host. Independent reproducer welcomed."
196+
promote_to_strong_when:
197+
- "Independent M5-class reproducer publishes the same lift on the same model class"
198+
- "MLX-vs-newer-llama.cpp-Metal comparison still shows MLX ahead"
199+
200+
- id: hw.best-stack.strix.rocm7-works
201+
text: >
202+
ROCm 7 (libamdhip64.so.7), as bundled by the dream-server
203+
dream-lemonade-server:latest container's custom llama.cpp build
204+
at /opt/llama-custom/llama-server, loads and serves
205+
Qwen3.6-27B-Q8 on Strix Halo. The canonical study's
206+
[hw.q8.engine.rocm-strix-halo-segfault] failure is specific to
207+
ROCm 6.4.4 + llama.cpp b9151; updating to ROCm 7 and dream-server's
208+
downstream patch set resolves the loading bug. Six cells at
209+
ctx≤4K plus ctx=16K gen=128 confirm: decode runs to completion,
210+
output content SHAs are non-empty, no crash.
211+
status: provisional
212+
scope: ROCm 7, Strix Halo gfx1151, dream-server's bundled llama.cpp (commit ff5ef82), Q8 GGUF
213+
evidence: "hardware-tests/best-stack-followup-2026-05-17/findings.md § Finding 2 + AUDIT.md BA7"
214+
caveats:
215+
- "We did not isolate whether the fix is in the ROCm 7 runtime itself or in dream-server's downstream patches. Vanilla llama.cpp b9151 against ROCm 7 is a queued follow-up."
216+
- "Bundled engine is older (commit ff5ef82) than canonical's b9151. This bundle's results therefore mix a ROCm-runtime-version effect with a llama.cpp-vintage effect."
217+
218+
- id: hw.best-stack.strix.no-prefill-lift
219+
text: >
220+
AMD's productized Linux stack on Strix Halo (dream-server's
221+
bundled custom llama.cpp + ROCm 7, fronted by Lemonade Server)
222+
delivers no decode lift over vanilla llama.cpp Vulkan at b9151
223+
(7.67 vs 7.82 tok/s peak, within run-to-run noise) and is
224+
substantially slower on prefill (120 vs 292 tok/s peak,
225+
~2.4× behind). The prefill cost is attributable to engine
226+
vintage (bundled build at upstream commit ff5ef82 vs canonical's
227+
b9151) rather than a ROCm-vs-Vulkan effect.
228+
status: provisional
229+
scope: conc=1, Qwen3.6-27B-Q8 GGUF (byte-identical to canonical), Strix Halo Linux
230+
evidence: "hardware-tests/best-stack-followup-2026-05-17/findings.md § Finding 2 + AUDIT.md BA1"
231+
caveats:
232+
- "Strix Halo's productized acceleration story is the Lemonade OGA/NPU path on Windows + DirectML + INT4, which we did NOT test (no Windows Strix host available)."
233+
- "Pure Lemonade SDK on Linux with its own bundled Vulkan binary (vs dream-server's custom downstream build) is a queued follow-up."
234+
174235
# ─── Power-sweep studies (hardware-tests/vllm-power-sweep-2026-04-29,
175236
# hardware-tests/ltx23-power-sweep-2026-05-05, cpu-fullpower) ───
176237

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# AUDIT — best-stack follow-up bundle (2026-05-17)
2+
3+
This bundle deliberately changes engines from the canonical [`qwen3.6-q8-fleet-2026-05-17`](../qwen3.6-q8-fleet-2026-05-17/) study to answer a specific vendor-stack question (does the productized stack accelerate the workload?). That makes the apples-to-apples boundary subtler than the canonical study's, so this audit doc walks through what's locked, what varies, and what the comparison can and cannot support.
4+
5+
## What is locked across the canonical study and this bundle
6+
7+
- **Prompt corpus**: byte-identical to canonical. SHA-pinned (`9a27eba85a8da9443d7fcf74e281b011831806c4b24aaaada3915463d5c13cd8`); verifiable via `workloads/prompts.jsonl.sha256`. The same 120 prompts at four context targets were sent to every host in both studies.
8+
- **Generation parameters**: temperature=0, seed=42, max_tokens fixed per cell, `cache_prompt=false` for the Lemonade/dream-server runs (no warm-cache short-circuit) to match the canonical study's no-cache discipline.
9+
- **Grid**: 4 ctx × 3 gen lengths × N=10 (2 warmup discarded) per cell. Same shape as the canonical conc=1 column.
10+
- **Hosts**: physical machines are the same. M5 Max MBP (same chassis, same SoC, same room), EVO X2 / Strix Halo (same chassis, same SoC, same room).
11+
- **Scope statement**: single-user (conc=1) only, exactly as in the canonical study's headline. Multi-user (conc≥4) is out of scope on purpose.
12+
13+
## What varies (the whole point of this bundle)
14+
15+
| axis | canonical | this bundle | why varied |
16+
|---|---|---|---|
17+
| Engine, M5 side | llama.cpp Metal at SHA `67b2b7f2f` (`b9151`) | Apple `mlx-lm` 0.31.3 Python API | Tests whether Apple's productized stack lifts. |
18+
| Engine, Strix side | llama.cpp Vulkan at SHA `67b2b7f2f` (`b9151`) | dream-server bundled custom llama.cpp build at upstream commit `ff5ef82`, linked against ROCm 7 (`libamdhip64.so.7`) + custom `librocblas`/`libhipblaslt`. Fronted by Lemonade Server's OpenAI-compatible API. | Tests AMD's productized stack on Strix Halo. ROCm 7 (a newer runtime than the canonical's failing 6.4.4) is what dream-server ships in production. |
19+
| Model bytes, M5 side | `Qwen3.6-27B-Q8_0.gguf` and `Qwen3.6-35B-A3B-Q8_0.gguf` (GGUF Q8_0) | `mlx-community/Qwen3.6-27B-8bit` and `mlx-community/Qwen3.6-35B-A3B-8bit` (MLX-native 8-bit) | MLX uses its own quantization format; same model concept, different bytes. SHA-pinned per shard in `workloads/mlx-models.sha256`. |
20+
| Model bytes, Strix side | `Qwen3.6-27B-Q8_0.gguf` SHA `f93f517f…` | Same file, byte-identical | The GGUF path is one of the two factors that lets the Strix vs canonical comparison stay meaningful on file content. |
21+
| Concurrency wrapper | `bench-cell.py` hitting llama-server `/completion` | `bench-cell-mlx.py` calling `mlx_lm.stream_generate` directly OR `bench-cell-lemonade.py` hitting Lemonade `/v1/completions`. | Different APIs. Schemas converge via `harness/lib/canon-backfill.py` so cell.json fields match canonical. |
22+
23+
## Apples-to-apples boundaries (what the comparison can and cannot support)
24+
25+
This bundle **can** support claims about:
26+
27+
- **MLX vs llama.cpp Metal** on the SAME M5 hardware for Qwen3.6 dense + MoE at ~8-bit quantization, single user. Decode and prefill numbers are directly comparable in tok/s, and the model is the same model concept at the same nominal precision.
28+
- **dream-server ROCm 7 vs llama.cpp Vulkan b9151** on the SAME Strix Halo hardware for Qwen3.6-27B-Q8 GGUF (byte-identical file), single user. Decode is comparable directly; prefill cost can be attributed to engine vintage because the bundled custom llama.cpp build is older than canonical's pin.
29+
- **ROCm 7 loads where ROCm 6.4.4 did not** for the canonical workload on Strix Halo. This narrows (does not invalidate) the canonical "ROCm broken" claim — see `[hw.best-stack.strix.rocm7-works]` in `claims.yaml`.
30+
31+
This bundle **cannot** support claims about:
32+
33+
- **Cross-host ranking.** The canonical study owns that and this bundle does not feed it.
34+
- **Engine quality / output correctness differences.** Outputs are deterministic at temperature=0 / seed=42, but MLX uses a different quantization scheme than GGUF Q8. We do not run a semantic-equivalence diff between MLX 8-bit and GGUF Q8 outputs in this bundle (could be a follow-up).
35+
- **NVIDIA productized stack** (vLLM / TensorRT-LLM / SGLang on Tower2). Canonical study's vLLM Tower2 row is the existing data point; not re-covered here.
36+
- **Multi-user serving.** conc≥4 is held in canonical and held here.
37+
- **Lemonade SDK's own bundled Vulkan binary.** We caught during bring-up that the Lemonade Server we were hitting (the one running in dream-server's `dream-llama-server` container) is fronting dream-server's custom `/opt/llama-custom/` binary, NOT Lemonade's own `/opt/lemonade/llama/vulkan/` binary (those dirs exist but are empty in this container; Lemonade's own binaries only get fetched on first use of those backends). To test pure Lemonade-Vulkan against the same model we would need to spin up a separate Lemonade Server with `--llamacpp vulkan` and let it download its own binary; that is a queued follow-up.
38+
- **Strix Halo NPU acceleration.** The actually-novel Ryzen AI productized path is the Lemonade `oga-load` backend on Windows + DirectML + INT4 OGA models. We do not have a Windows Strix Halo host; that path is not exercised here.
39+
40+
## Specific biases (this study's B-list, complementing canonical's)
41+
42+
### BA1. Engine vintage cost on the Strix dream-server side
43+
dream-server bundles llama.cpp at upstream commit `ff5ef82`. Canonical pin is `67b2b7f2` (`b9151`), which is approximately 2641 commits ahead. Many of the Vulkan-and-shader optimizations between those commits land in the prefill path. The canonical Vulkan prefill of 292.3 tok/s at the peak cell vs this bundle's 120 tok/s is therefore an engine-vintage delta primarily, not a ROCm-vs-Vulkan delta. The decode-rate equivalence (within noise) suggests the decode path is similarly optimized in both versions or is bandwidth-limited rather than kernel-limited on this hardware.
44+
45+
### BA2. MLX quant scheme is not GGUF Q8
46+
`mlx-community/Qwen3.6-27B-8bit` is MLX's own 8-bit affine quantization, applied at MLX-format conversion time. GGUF `Q8_0` is llama.cpp's 8-bit quantization. Both are "8 bits per weight" at coarse granularity but the layouts, scale factors, and per-group bookkeeping differ. So a same-precision-different-bytes story. Quality is not measured in this bundle.
47+
48+
### BA3. MLX concurrency semantics differ
49+
`bench-cell-mlx.py` runs concurrent slots **sequentially** within a batch (not asyncio-parallel), unlike `bench-cell.py`'s `asyncio.gather` against llama-server's `--parallel N` slots. This bundle reports `conc=1` only, so the difference is moot — but if a future commit adds `conc≥4` to MLX cells, the data must be tagged differently because the throughput semantics are NOT comparable to canonical's slot-parallel numbers.
50+
51+
### BA4. `cache_prompt` flag handling between engines
52+
We verified empirically that `cache_prompt=false` in the request body is honored by the Lemonade-fronted dream-server llama.cpp binary (the body field passes through). The MLX driver creates a fresh KV cache per `stream_generate` call by construction so `cache_prompt` is not applicable. Both paths therefore do a full prefill per inference, matching the canonical study's discipline.
53+
54+
### BA5. Power / thermals not co-sampled
55+
Canonical study has 1 Hz power + thermals CSVs per cell. This bundle does NOT — we wanted to ship the productized-stack-comparison data quickly. The thermal-class story from canonical (`hw.q8.chassis-thermal-class`) still applies to the same chassis under similar workload intensities; we do not re-measure.
56+
57+
### BA6. Cold-start exposure differs per engine
58+
- MLX `bench-cell-mlx.py` records `cold_start.decode_tps` from batch 0's first stream tick. Model load time is reported separately as `load_time_s`.
59+
- dream-server's Lemonade-fronted llama.cpp warms the model at `/api/v1/load` time (not at first `/v1/completions`); our `cold_start` for those cells therefore reflects "first request after API-level model load", not "first request after process start". Different from canonical's bench-host warmup; comparable across cells *within* this bundle.
60+
61+
### BA7. Engine identification was nontrivial; future readers should not trust the `engine` field blindly
62+
We initially labeled the Strix cells `lemonade-llamacpp-vulkan` because the Lemonade Server CLI exposed `--llamacpp {vulkan,rocm,cpu}` and we assumed the productized SDK shipped the same Vulkan binary we expected. Inspecting the container's running process revealed it was actually `/opt/llama-custom/llama-server` (dream-server's downstream custom build) with `libamdhip64.so.7` (ROCm 7) linked. We relabeled to `dreamserver-llamacpp-rocm7` and added `engine_note` fields pointing at the binary path + library set. Future reviewers running similar bundles should always `ldd` the actual inference binary.
63+
64+
### BA8. HF refs are floating; weight bytes are pinned only by `mlx-models.sha256`
65+
The README's `hf download mlx-community/Qwen3.6-27B-8bit` command will pull whichever revision Hugging Face serves at the time. The actual bytes we ran against are captured in `workloads/mlx-models.sha256`. Reviewers verifying decode/prefill numbers should `shasum -c` against that file before quoting our deltas — if HF rev moves the weights underneath, our numbers might not reproduce on a fresh pull.
66+
67+
## Reproduction checklist
68+
69+
1. `shasum -a 256 -c workloads/prompts.jsonl.sha256` → prompt corpus byte-identical to canonical.
70+
2. `cd ~/models/mlx && shasum -a 256 -c <bundle>/workloads/mlx-models.sha256` → MLX weight shards byte-identical to ours.
71+
3. `sha256sum Qwen3.6-27B-Q8_0.gguf``f93f517f38e696d35a1a7df2c0e3155a64f4c4dcd662107a146ae263f7fb14ce` (the canonical study's pin; same file for Strix side here).
72+
4. MLX engine: `pip install mlx-lm==0.31.3` (or `>=0.31.3`; record version in `cell.json.engine_version`).
73+
5. dream-server engine: clone `dream-lemonade-server:latest` container; binary at `/opt/llama-custom/llama-server`. `ldd` should show `libamdhip64.so.7`.
74+
6. Run `harness/lib/run-mlx-grid.sh` on M5 and `harness/lib/run-lemonade-grid.sh` on Strix.
75+
7. Verify `aggregate/headline.csv` recomputes via `python3 harness/lib/build-headline.py --bundle . --out /tmp/check.csv && diff aggregate/headline.csv /tmp/check.csv`.

0 commit comments

Comments
 (0)