scripts/rdna4/windows: Gemma 4 26B-A4B + 256K + Vulkan + MoE offload

icex · icex · commit 3bdf4b4d4858 · 2026-04-13T00:41:06.000+03:00
Adds the Windows bring-up scripts for running Gemma 4 26B-A4B-it at full
256K context on an AMD Ryzen 7 9800X3D + RX 9700 XT (gfx1201, 16 GB).

The model cannot fit fully on 16 GB VRAM (UD-Q4_K_XL is 17.1 GB). The
strategy is to push MoE expert FFN tensors (~14 GB of ~17 GB) to CPU via
-ncmoe 30, leaving only ~2.7 GB of non-expert weights on GPU. With the
compute buffer (~8.3 GB at -b 128 -ub 64) and turbo3 KV cache (~1.05 GB
at 256K), total GPU occupation lands at ~12.5 GB.

The non-obvious flag is -b 128 -ub 64. The compute buffer is batch-driven,
not context-driven — the default -b 2048 allocates a ~32 GB Metal buffer
at 256K and instantly OOMs. Validated empirically on M4 Max Mac: 128K
at -b 256 ub=128 takes 8.4 GB, 256K at -b 128 ub=64 takes 8.3 GB. Batch
size dominates, context does not.

Vulkan FA path requires symmetric K/V (op-&gt;src[1]-&gt;type ==
op-&gt;src[2]-&gt;type in ggml-vulkan.cpp:15371), so asymmetric q8_0/turbo3
is not available on this backend — only symmetric turbo3/turbo3.

Scripts:
  build.ps1          — cmake configure + build with -DGGML_VULKAN=ON
  check-prereqs.ps1  — MSVC, Vulkan SDK, driver, GPU, RAM, model file
  run-gemma4-256k.ps1 — llama-server with the locked-in flag set
  run-claude-code.ps1 — Claude Code wrapper with env + KV-cache guardrail
  README.md          — budget analysis, risks, troubleshooting
diff --git a/scripts/rdna4/windows/README.md b/scripts/rdna4/windows/README.md
@@ -0,0 +1,95 @@
+# RDNA 4 Windows — Gemma 4 26B-A4B @ 256K
+
+Target hardware: AMD Ryzen 7 9800X3D + 32 GB DDR5 + RX 9700 XT (gfx1201, 16 GB).
+Target model: `unsloth/gemma-4-26B-A4B-it-GGUF` at `UD-Q4_K_XL` (17.1 GB).
+Target context: 256 K tokens.
+
+## Bring-up
+
+1. Install prerequisites (one-time):
+   - **Visual Studio 2022 Build Tools** with "Desktop development with C++" workload
+   - **CMake ≥ 3.25**: `winget install Kitware.CMake`
+   - **Git**: `winget install Git.Git`
+   - **Vulkan SDK**: https://vulkan.lunarg.com (installer sets `VULKAN_SDK`)
+   - **AMD Adrenalin 25.x** driver: https://amd.com/support
+   - **Node.js LTS** (for Claude Code): `winget install OpenJS.NodeJS.LTS`
+
+   Reboot after the driver install.
+
+2. Open **"x64 Native Tools Command Prompt for VS 2022"**, launch PowerShell inside it (`pwsh` or `powershell`), then:
+
+   ```powershell
+   cd path\to\llama-cpp-turboquant
+   .\scripts\rdna4\windows\check-prereqs.ps1   # verifies tools, driver, GPU, RAM, model
+   .\scripts\rdna4\windows\build.ps1           # configures + builds Vulkan backend
+   ```
+
+3. Download the model GGUF (17.1 GB) to `C:\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf` or set `TQ_MODEL_PATH`.
+
+4. Launch the server:
+   ```powershell
+   .\scripts\rdna4\windows\run-gemma4-256k.ps1
+   ```
+
+5. In a second shell, launch Claude Code:
+   ```powershell
+   .\scripts\rdna4\windows\run-claude-code.ps1
+   ```
+
+## Memory budget (why these flags matter)
+
+### GPU VRAM — 16 GB target
+
+| Component | Size | Notes |
+|---|---|---|
+| Non-expert model tensors | ~2.7 GB | attention Q/K/V/O, embeddings, lm_head, norms, MoE routers |
+| KV cache turbo3 @ 256K | ~1.05 GB | 1.0 GB global (5 layers × 256K × D=512) + 49 MB SWA (25 layers × 1280 cells) |
+| Compute buffer (`-b 128 -ub 64`) | ~8.3 GB | Batch-driven; does NOT scale with context. Default `-b 2048` allocates ~32 GB and instant-OOMs |
+| Vulkan driver overhead | ~0.5 GB | |
+| **Total** | **~12.5 GB** | ~3.5 GB headroom |
+
+### System RAM — 32 GB target
+
+| Component | Size | Notes |
+|---|---|---|
+| Expert REPACK (CPU backend) | ~8.2 GB | Q4_K expert weights reformatted for CPU-efficient access |
+| mmap hot working set | ~9 GB | File-backed; OS pages cold experts |
+| OS + drivers + desktop | ~5 GB | |
+| Claude Code / VSCode / browser | ~3 GB | Close heavy Electron apps during runs |
+| **Total active** | **~25 GB** | Tight. `--mlock` keeps experts resident |
+
+## Critical flags — do not change without understanding
+
+- `-b 128 -ub 64`: dropping below is fine, raising breaks VRAM budget. At `-b 256` compute buffer grows to ~16 GB (OOM).
+- `-ctk turbo3 -ctv turbo3`: Vulkan FA path rejects mixed K/V types (`op->src[1]->type != op->src[2]->type` → false). You cannot use asymmetric q8_0/turbo3 on Vulkan — that only works on Metal/CUDA.
+- `-ncmoe 30`: all 30 layers' experts go to CPU. Without this, the 14 GB of expert FFN weights force-push model into VRAM and OOM.
+- `--mlock`: mandatory on Windows at 32 GB. Prevents pagefile swaps of expert weights. Without it, a single page fault during decode can stall the token by seconds.
+- `--no-warmup`: prevents an empty-batch allocation spike that can OOM at tight VRAM.
+
+## Expected performance (projected)
+
+Decode: 25-50 tok/s (CPU MoE bandwidth-bound, DDR5-6000 ~80 GB/s).
+Prefill at 256K: 1-3 minutes (first-token latency, cached afterward).
+
+Reference Mac M4 Max same config (full GPU, no CPU MoE offload):
+- pp2048: 710 tok/s
+- tg64 @ d=0: 49 tok/s
+- tg64 @ d=128K: 21 tok/s
+
+The Mac numbers set a ceiling. Windows Vulkan scalar-path performance will be lower than Metal, and CPU MoE adds bandwidth overhead, so expect 40-70 % of Mac decode speed in practice.
+
+## Known risks
+
+1. **Vulkan turbo3 at D=256 / D=512 is untested on Gemma 4.** The code paths exist (`flash_attn.comp`, `flash_attn_cm1.comp`, `flash_attn_cm2.comp` all compile turbo3_0 pipeline variants via `DATA_A_TURBO3_0`; FA dispatch accepts `GGML_TYPE_TURBO3_0` for any `HSK % 8 == 0`). Metal was explicitly validated by commit `716dd77`. Vulkan was not. First smoke test after build should verify generation quality; garbled output means fall back to q8_0 and accept smaller context.
+
+2. **MoE CPU offload on AMD Vulkan is unmeasured.** The `-ncmoe` flag is backend-agnostic in theory. On Mac it works functionally but regresses speed badly because Metal GPU > Mac CPU for FFN. On Windows the opposite should hold — Vulkan GPU cannot hold the experts anyway, so CPU is the only option, and Zen 5 + DDR5 + 3D V-Cache is a strong CPU MoE target.
+
+3. **32 GB RAM is tight.** Active working set projects to ~25 GB. Close VSCode, Chrome, Slack, etc. before runs. A desktop with ~8 GB wired for OS leaves you at 24 GB for the inference process.
+
+## Troubleshooting
+
+- **OOM at load time**: check `-b / -ub`. Default is way too aggressive.
+- **OOM during warmup**: add `--no-warmup` (already on by default in our script).
+- **Garbled output at turbo3**: verify with `-ctk q8_0 -ctv q8_0` and smaller context. If q8_0 works and turbo3 doesn't, the Vulkan D=256/512 turbo3 path has a correctness issue — report upstream.
+- **Slow decode (< 10 tok/s)**: check `--mlock` took effect (Task Manager → Performance → Memory → Committed). If pagefile is growing, `--mlock` is silently failing due to quota; grant "Lock pages in memory" user right via `gpedit.msc` → Computer Configuration → Windows Settings → Security Settings → Local Policies → User Rights Assignment.
+- **Claude Code KV cache thrashing**: verify `CLAUDE_CODE_ATTRIBUTION_HEADER: "0"` is in `%USERPROFILE%\.claude\settings.json`. Without this, decode speed drops ~90 %.
diff --git a/scripts/rdna4/windows/build.ps1 b/scripts/rdna4/windows/build.ps1
@@ -0,0 +1,38 @@
+#!/usr/bin/env pwsh
+# Build llama-cpp-turboquant with Vulkan backend for RDNA 4 (RX 9700 XT / gfx1201).
+# Run from "x64 Native Tools Command Prompt for VS 2022".
+
+$ErrorActionPreference = "Stop"
+
+$repoRoot  = Resolve-Path (Join-Path $PSScriptRoot "..\..\..")
+$buildDir  = Join-Path $repoRoot "build-vulkan"
+
+if (-not $env:VULKAN_SDK -or -not (Test-Path $env:VULKAN_SDK)) {
+    Write-Host "VULKAN_SDK not set or path missing." -ForegroundColor Red
+    Write-Host "Install from https://vulkan.lunarg.com and restart the shell." -ForegroundColor Yellow
+    exit 1
+}
+
+Write-Host "Configuring cmake (Vulkan, Release)..." -ForegroundColor Cyan
+cmake -B $buildDir -S $repoRoot `
+    -DCMAKE_BUILD_TYPE=Release `
+    -DGGML_VULKAN=ON `
+    -DGGML_NATIVE=ON `
+    -DLLAMA_BUILD_TESTS=OFF `
+    -DLLAMA_BUILD_EXAMPLES=OFF `
+    -DLLAMA_BUILD_TOOLS=ON
+if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE }
+
+Write-Host "Building..." -ForegroundColor Cyan
+cmake --build $buildDir --config Release --target llama-server llama-cli llama-bench --parallel
+if ($LASTEXITCODE -ne 0) { exit $LASTEXITCODE }
+
+$llamaServer = Join-Path $buildDir "bin\Release\llama-server.exe"
+if (Test-Path $llamaServer) {
+    Write-Host ""
+    Write-Host "Built: $llamaServer" -ForegroundColor Green
+    & $llamaServer --version 2>&1 | Select-Object -First 3
+} else {
+    Write-Host "Build succeeded but llama-server.exe not found at expected path." -ForegroundColor Red
+    exit 1
+}
diff --git a/scripts/rdna4/windows/check-prereqs.ps1 b/scripts/rdna4/windows/check-prereqs.ps1
@@ -0,0 +1,74 @@
+#!/usr/bin/env pwsh
+# Preflight check for RDNA 4 Vulkan build + Gemma 4 run.
+# Verifies toolchain, driver, VRAM, model file, system RAM.
+# Run from "x64 Native Tools Command Prompt for VS 2022" to get MSVC env.
+
+$ErrorActionPreference = "Stop"
+
+$ok = $true
+function Check($label, $test, $fix) {
+    Write-Host -NoNewline "  $label ... "
+    try {
+        $result = & $test
+        if ($result) {
+            Write-Host "ok" -ForegroundColor Green
+            return $true
+        } else {
+            Write-Host "MISSING" -ForegroundColor Red
+            Write-Host "    fix: $fix" -ForegroundColor Yellow
+            $script:ok = $false
+            return $false
+        }
+    } catch {
+        Write-Host "ERROR: $_" -ForegroundColor Red
+        Write-Host "    fix: $fix" -ForegroundColor Yellow
+        $script:ok = $false
+        return $false
+    }
+}
+
+Write-Host "=== RDNA 4 / Vulkan / Gemma 4 preflight ===" -ForegroundColor Cyan
+Write-Host ""
+
+Write-Host "Toolchain:"
+Check "MSVC cl.exe" { (Get-Command cl -ErrorAction SilentlyContinue) -ne $null } "Open 'x64 Native Tools Command Prompt for VS 2022', or install VS 2022 Build Tools with C++ workload"
+Check "cmake >= 3.25" {
+    $v = (cmake --version 2>&1 | Select-Object -First 1) -replace 'cmake version ', ''
+    [version]$v -ge [version]"3.25.0"
+} "winget install Kitware.CMake (then restart shell)"
+Check "git" { (Get-Command git -ErrorAction SilentlyContinue) -ne $null } "winget install Git.Git"
+Check "VULKAN_SDK env" { $env:VULKAN_SDK -ne $null -and (Test-Path $env:VULKAN_SDK) } "Install from https://vulkan.lunarg.com (Windows installer sets VULKAN_SDK automatically)"
+Check "glslc shader compiler" { (Get-Command glslc -ErrorAction SilentlyContinue) -ne $null } "Reinstall Vulkan SDK and ensure %VULKAN_SDK%\Bin is in PATH"
+
+Write-Host ""
+Write-Host "GPU / Driver:"
+Check "vulkaninfo" { (Get-Command vulkaninfo -ErrorAction SilentlyContinue) -ne $null } "Install AMD Adrenalin 25.x driver from amd.com/support"
+Check "RX 9700 XT detected" {
+    $info = vulkaninfo --summary 2>$null | Out-String
+    $info -match "Radeon RX 9700 XT" -or $info -match "gfx1201"
+} "Check AMD Adrenalin driver is installed and GPU is seated"
+
+Write-Host ""
+Write-Host "System:"
+$totalRam = [math]::Round((Get-CimInstance Win32_ComputerSystem).TotalPhysicalMemory / 1GB, 1)
+Check "RAM >= 32 GB (have: $totalRam GB)" { $totalRam -ge 31 } "Gemma 4 26B expert offload needs ~22 GB working set"
+
+Write-Host ""
+Write-Host "Model file:"
+$defaultModel = if ($env:TQ_MODEL_PATH) { $env:TQ_MODEL_PATH } else { "C:\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf" }
+Check "GGUF present ($defaultModel)" { Test-Path $defaultModel } "Download from https://huggingface.co/unsloth/gemma-4-26B-A4B-it-GGUF (UD-Q4_K_XL, ~17.1 GB) and place at the path above, or set TQ_MODEL_PATH"
+
+Write-Host ""
+Write-Host "Build output:"
+$repoRoot = Resolve-Path (Join-Path $PSScriptRoot "..\..\..")
+$llamaServer = Join-Path $repoRoot "build-vulkan\bin\Release\llama-server.exe"
+Check "llama-server.exe built" { Test-Path $llamaServer } "Run .\scripts\rdna4\windows\build.ps1 first"
+
+Write-Host ""
+if ($ok) {
+    Write-Host "All checks passed." -ForegroundColor Green
+    exit 0
+} else {
+    Write-Host "One or more checks failed. See messages above." -ForegroundColor Red
+    exit 1
+}
diff --git a/scripts/rdna4/windows/run-claude-code.ps1 b/scripts/rdna4/windows/run-claude-code.ps1
@@ -0,0 +1,49 @@
+#!/usr/bin/env pwsh
+# Start Claude Code CLI against the local llama-server with Gemma 4.
+# Assumes run-gemma4-256k.ps1 is already running on the configured port.
+
+$ErrorActionPreference = "Stop"
+
+$port    = if ($env:TQ_PORT) { $env:TQ_PORT } else { "8001" }
+$host_ip = if ($env:TQ_HOST) { $env:TQ_HOST } else { "127.0.0.1" }
+$server  = "http://$host_ip`:$port"
+
+# --- Sanity check server is up ---
+try {
+    $null = Invoke-WebRequest -Uri "$server/health" -TimeoutSec 2 -UseBasicParsing
+    Write-Host "llama-server reachable at $server" -ForegroundColor Green
+} catch {
+    Write-Host "Cannot reach llama-server at $server" -ForegroundColor Red
+    Write-Host "Start it first: .\scripts\rdna4\windows\run-gemma4-256k.ps1" -ForegroundColor Yellow
+    exit 1
+}
+
+# --- Client env ---
+$env:ANTHROPIC_BASE_URL  = $server
+$env:ANTHROPIC_AUTH_TOKEN = "local"
+$env:ANTHROPIC_API_KEY   = ""
+$env:ANTHROPIC_MODEL     = "gemma-4-26b"
+
+# --- KV cache preservation ---
+# Claude Code adds an attribution header by default. Every request with a new
+# header permutation invalidates the prompt cache on the server side. Setting
+# this to 0 prevents a ~90% decode-speed regression.
+# (Also needs CLAUDE_CODE_ATTRIBUTION_HEADER=0 in %USERPROFILE%\.claude\settings.json.)
+$env:CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = "1"
+
+$settingsPath = Join-Path $env:USERPROFILE ".claude\settings.json"
+if (-not (Test-Path $settingsPath)) {
+    Write-Host "WARNING: $settingsPath missing." -ForegroundColor Yellow
+    Write-Host "  Create it with: { `"env`": { `"CLAUDE_CODE_ATTRIBUTION_HEADER`": `"0`" } }" -ForegroundColor Yellow
+} else {
+    $settings = Get-Content $settingsPath -Raw | ConvertFrom-Json
+    $attrib = $settings.env.CLAUDE_CODE_ATTRIBUTION_HEADER
+    if ($attrib -ne "0") {
+        Write-Host "WARNING: CLAUDE_CODE_ATTRIBUTION_HEADER not set to '0' in settings.json" -ForegroundColor Yellow
+        Write-Host "  Without this, Claude Code invalidates the KV cache on every request." -ForegroundColor Yellow
+    }
+}
+
+# --- Launch ---
+Write-Host "Launching Claude Code..." -ForegroundColor Cyan
+npx -y "@anthropic-ai/claude-code@latest" @args
diff --git a/scripts/rdna4/windows/run-gemma4-256k.ps1 b/scripts/rdna4/windows/run-gemma4-256k.ps1
@@ -0,0 +1,109 @@
+#!/usr/bin/env pwsh
+# Launch llama-server with Gemma 4 26B-A4B-it at 256K context on RDNA 4 Vulkan.
+#
+# Config rationale (see scripts/rdna4/windows/README.md for full analysis):
+#
+#   -ctk turbo3 -ctv turbo3   Symmetric required — Vulkan FA path does not
+#                             support mixed K/V quant types. turbo3 is the
+#                             only turbo flavor with Vulkan shaders.
+#
+#   -fa on                    Flash attention. Scalar + cm1 + cm2 FA shaders
+#                             all ship turbo3 pipeline variants.
+#
+#   -ngl 99                   Offload all layers' structural tensors to GPU
+#                             (attention, embeddings, norms).
+#
+#   -ncmoe 30                 Override expert FFN tensors on all 30 MoE
+#                             layers back to CPU. Gemma 4 26B-A4B has ~14 GB
+#                             of expert weights, which blows the 16 GB VRAM
+#                             budget. Experts compute on CPU, attention on
+#                             GPU, activations cross via small PCIe copies.
+#
+#   -c 262144                 Full 256K context.
+#
+#   -b 128 -ub 64             CRITICAL. Compute buffer is batch-driven, not
+#                             ctx-driven. Default (-b 2048 -ub 512) allocates
+#                             ~32 GB of compute buffer and instant-OOMs on
+#                             16 GB VRAM. b=128 ub=64 keeps it at ~8 GB.
+#
+#   --mlock                   Pin expert weights in RAM. Prevents Windows
+#                             from paging experts to disk under memory
+#                             pressure. Without this, per-token decode can
+#                             stall for seconds.
+#
+#   --no-warmup               Skip the empty-batch warmup pass. Warmup can
+#                             trigger OOM on tight VRAM budgets.
+#
+#   --threads 16              Zen 5 9800X3D = 8P/16T. CPU-side expert FFN
+#                             is memory-bandwidth bound; more threads past
+#                             8 rarely help but don't hurt.
+#
+# VRAM budget target (16 GB RX 9700 XT):
+#   non-expert model   ~2.7 GB
+#   KV cache turbo3    ~1.0 GB  (1 GB global + 49 MB SWA)
+#   compute buffer     ~8.3 GB
+#   driver overhead    ~0.5 GB
+#   TOTAL              ~12.5 GB   (~3.5 GB headroom)
+#
+# System RAM target (32 GB):
+#   expert REPACK      ~8.2 GB   (CPU backend reformats experts)
+#   mmap hot set       ~9.0 GB
+#   OS + apps          ~8.0 GB
+#   TOTAL              ~25.2 GB  (close Electron apps during runs)
+
+$ErrorActionPreference = "Stop"
+
+# --- Paths ---
+$repoRoot     = Resolve-Path (Join-Path $PSScriptRoot "..\..\..")
+$llamaServer  = Join-Path $repoRoot "build-vulkan\bin\Release\llama-server.exe"
+$defaultModel = "C:\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"
+
+# --- Config (override via env vars) ---
+$model   = if ($env:TQ_MODEL_PATH) { $env:TQ_MODEL_PATH } else { $defaultModel }
+$port    = if ($env:TQ_PORT)       { $env:TQ_PORT }       else { "8001" }
+$host_ip = if ($env:TQ_HOST)       { $env:TQ_HOST }       else { "127.0.0.1" }
+$ctx     = if ($env:TQ_CTX)        { $env:TQ_CTX }        else { "262144" }
+$batch   = if ($env:TQ_BATCH)      { $env:TQ_BATCH }      else { "128" }
+$ubatch  = if ($env:TQ_UBATCH)     { $env:TQ_UBATCH }     else { "64" }
+$ctk     = if ($env:TQ_CTK)        { $env:TQ_CTK }        else { "turbo3" }
+$ctv     = if ($env:TQ_CTV)        { $env:TQ_CTV }        else { "turbo3" }
+$threads = if ($env:TQ_THREADS)    { $env:TQ_THREADS }    else { "16" }
+
+# --- Preflight ---
+if (-not (Test-Path $llamaServer)) {
+    Write-Host "llama-server.exe not found at: $llamaServer" -ForegroundColor Red
+    Write-Host "Run .\scripts\rdna4\windows\build.ps1 first." -ForegroundColor Yellow
+    exit 1
+}
+if (-not (Test-Path $model)) {
+    Write-Host "Model not found at: $model" -ForegroundColor Red
+    Write-Host "Download the GGUF and set TQ_MODEL_PATH, or place at default path." -ForegroundColor Yellow
+    exit 1
+}
+
+# --- Summary ---
+Write-Host "=== Gemma 4 26B-A4B @ 256K (Vulkan + MoE CPU offload) ===" -ForegroundColor Cyan
+Write-Host "  model        : $model"
+Write-Host "  context      : $ctx tokens"
+Write-Host "  KV cache     : K=$ctk V=$ctv (symmetric required for Vulkan FA)"
+Write-Host "  batch/ubatch : $batch / $ubatch   (DO NOT raise — compute buffer OOMs)"
+Write-Host "  MoE offload  : all 30 expert layers on CPU"
+Write-Host "  listen       : http://$host_ip`:$port"
+Write-Host ""
+
+# --- Launch ---
+& $llamaServer `
+    -m $model `
+    --alias "gemma-4-26b" `
+    -ctk $ctk -ctv $ctv `
+    -fa on `
+    -ngl 99 `
+    -ncmoe 30 `
+    -c $ctx `
+    -b $batch -ub $ubatch `
+    --mlock `
+    --no-warmup `
+    --threads $threads `
+    -np 1 `
+    --host $host_ip --port $port `
+    @args