|
| 1 | +# RDNA 4 Windows — Gemma 4 26B-A4B @ 256K |
| 2 | + |
| 3 | +Target hardware: AMD Ryzen 7 9800X3D + 32 GB DDR5 + RX 9700 XT (gfx1201, 16 GB). |
| 4 | +Target model: `unsloth/gemma-4-26B-A4B-it-GGUF` at `UD-Q4_K_XL` (17.1 GB). |
| 5 | +Target context: 256 K tokens. |
| 6 | + |
| 7 | +## Bring-up |
| 8 | + |
| 9 | +1. Install prerequisites (one-time): |
| 10 | + - **Visual Studio 2022 Build Tools** with "Desktop development with C++" workload |
| 11 | + - **CMake ≥ 3.25**: `winget install Kitware.CMake` |
| 12 | + - **Git**: `winget install Git.Git` |
| 13 | + - **Vulkan SDK**: https://vulkan.lunarg.com (installer sets `VULKAN_SDK`) |
| 14 | + - **AMD Adrenalin 25.x** driver: https://amd.com/support |
| 15 | + - **Node.js LTS** (for Claude Code): `winget install OpenJS.NodeJS.LTS` |
| 16 | + |
| 17 | + Reboot after the driver install. |
| 18 | + |
| 19 | +2. Open **"x64 Native Tools Command Prompt for VS 2022"**, launch PowerShell inside it (`pwsh` or `powershell`), then: |
| 20 | + |
| 21 | + ```powershell |
| 22 | + cd path\to\llama-cpp-turboquant |
| 23 | + .\scripts\rdna4\windows\check-prereqs.ps1 # verifies tools, driver, GPU, RAM, model |
| 24 | + .\scripts\rdna4\windows\build.ps1 # configures + builds Vulkan backend |
| 25 | + ``` |
| 26 | + |
| 27 | +3. Download the model GGUF (17.1 GB) to `C:\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` or set `TQ_MODEL_PATH`. |
| 28 | + |
| 29 | +4. Launch the server: |
| 30 | + ```powershell |
| 31 | + .\scripts\rdna4\windows\run-gemma4-256k.ps1 |
| 32 | + ``` |
| 33 | + |
| 34 | +5. In a second shell, launch Claude Code: |
| 35 | + ```powershell |
| 36 | + .\scripts\rdna4\windows\run-claude-code.ps1 |
| 37 | + ``` |
| 38 | + |
| 39 | +## Memory budget (why these flags matter) |
| 40 | + |
| 41 | +### GPU VRAM — 16 GB target |
| 42 | + |
| 43 | +| Component | Size | Notes | |
| 44 | +|---|---|---| |
| 45 | +| Non-expert model tensors | ~2.7 GB | attention Q/K/V/O, embeddings, lm_head, norms, MoE routers | |
| 46 | +| KV cache turbo3 @ 256K | ~1.05 GB | 1.0 GB global (5 layers × 256K × D=512) + 49 MB SWA (25 layers × 1280 cells) | |
| 47 | +| Compute buffer (`-b 128 -ub 64`) | ~8.3 GB | Batch-driven; does NOT scale with context. Default `-b 2048` allocates ~32 GB and instant-OOMs | |
| 48 | +| Vulkan driver overhead | ~0.5 GB | | |
| 49 | +| **Total** | **~12.5 GB** | ~3.5 GB headroom | |
| 50 | + |
| 51 | +### System RAM — 32 GB target |
| 52 | + |
| 53 | +| Component | Size | Notes | |
| 54 | +|---|---|---| |
| 55 | +| Expert REPACK (CPU backend) | ~8.2 GB | Q4_K expert weights reformatted for CPU-efficient access | |
| 56 | +| mmap hot working set | ~9 GB | File-backed; OS pages cold experts | |
| 57 | +| OS + drivers + desktop | ~5 GB | | |
| 58 | +| Claude Code / VSCode / browser | ~3 GB | Close heavy Electron apps during runs | |
| 59 | +| **Total active** | **~25 GB** | Tight. `--mlock` keeps experts resident | |
| 60 | + |
| 61 | +## Critical flags — do not change without understanding |
| 62 | + |
| 63 | +- `-b 128 -ub 64`: dropping below is fine, raising breaks VRAM budget. At `-b 256` compute buffer grows to ~16 GB (OOM). |
| 64 | +- `-ctk turbo3 -ctv turbo3`: Vulkan FA path rejects mixed K/V types (`op->src[1]->type != op->src[2]->type` → false). You cannot use asymmetric q8_0/turbo3 on Vulkan — that only works on Metal/CUDA. |
| 65 | +- `-ncmoe 30`: all 30 layers' experts go to CPU. Without this, the 14 GB of expert FFN weights force-push model into VRAM and OOM. |
| 66 | +- `--mlock`: mandatory on Windows at 32 GB. Prevents pagefile swaps of expert weights. Without it, a single page fault during decode can stall the token by seconds. |
| 67 | +- `--no-warmup`: prevents an empty-batch allocation spike that can OOM at tight VRAM. |
| 68 | + |
| 69 | +## Expected performance (projected) |
| 70 | + |
| 71 | +Decode: 25-50 tok/s (CPU MoE bandwidth-bound, DDR5-6000 ~80 GB/s). |
| 72 | +Prefill at 256K: 1-3 minutes (first-token latency, cached afterward). |
| 73 | + |
| 74 | +Reference Mac M4 Max same config (full GPU, no CPU MoE offload): |
| 75 | +- pp2048: 710 tok/s |
| 76 | +- tg64 @ d=0: 49 tok/s |
| 77 | +- tg64 @ d=128K: 21 tok/s |
| 78 | + |
| 79 | +The Mac numbers set a ceiling. Windows Vulkan scalar-path performance will be lower than Metal, and CPU MoE adds bandwidth overhead, so expect 40-70 % of Mac decode speed in practice. |
| 80 | + |
| 81 | +## Known risks |
| 82 | + |
| 83 | +1. **Vulkan turbo3 at D=256 / D=512 is untested on Gemma 4.** The code paths exist (`flash_attn.comp`, `flash_attn_cm1.comp`, `flash_attn_cm2.comp` all compile turbo3_0 pipeline variants via `DATA_A_TURBO3_0`; FA dispatch accepts `GGML_TYPE_TURBO3_0` for any `HSK % 8 == 0`). Metal was explicitly validated by commit `716dd77`. Vulkan was not. First smoke test after build should verify generation quality; garbled output means fall back to q8_0 and accept smaller context. |
| 84 | + |
| 85 | +2. **MoE CPU offload on AMD Vulkan is unmeasured.** The `-ncmoe` flag is backend-agnostic in theory. On Mac it works functionally but regresses speed badly because Metal GPU > Mac CPU for FFN. On Windows the opposite should hold — Vulkan GPU cannot hold the experts anyway, so CPU is the only option, and Zen 5 + DDR5 + 3D V-Cache is a strong CPU MoE target. |
| 86 | + |
| 87 | +3. **32 GB RAM is tight.** Active working set projects to ~25 GB. Close VSCode, Chrome, Slack, etc. before runs. A desktop with ~8 GB wired for OS leaves you at 24 GB for the inference process. |
| 88 | + |
| 89 | +## Troubleshooting |
| 90 | + |
| 91 | +- **OOM at load time**: check `-b / -ub`. Default is way too aggressive. |
| 92 | +- **OOM during warmup**: add `--no-warmup` (already on by default in our script). |
| 93 | +- **Garbled output at turbo3**: verify with `-ctk q8_0 -ctv q8_0` and smaller context. If q8_0 works and turbo3 doesn't, the Vulkan D=256/512 turbo3 path has a correctness issue — report upstream. |
| 94 | +- **Slow decode (< 10 tok/s)**: check `--mlock` took effect (Task Manager → Performance → Memory → Committed). If pagefile is growing, `--mlock` is silently failing due to quota; grant "Lock pages in memory" user right via `gpedit.msc` → Computer Configuration → Windows Settings → Security Settings → Local Policies → User Rights Assignment. |
| 95 | +- **Claude Code KV cache thrashing**: verify `CLAUDE_CODE_ATTRIBUTION_HEADER: "0"` is in `%USERPROFILE%\.claude\settings.json`. Without this, decode speed drops ~90 %. |
0 commit comments